Context-Aware Sliding Window for Sentiment Classification

Sentiment classification is an active area of research with applications in many domains. Many researchers in the past have proposed techniques to identify sentiments with reasonable accuracy. However, the focus is more on the syntactic and semantic features of the documents. These features are effective but they ignore the user’s past sentiments. In this research, we hypothesize that the past sentiments help the classifier to effectively link the user’s history along with the contents of the current tweet. Thus, allowing learning algorithms to correlate past activities in determining the current sentiments. For this sake, we propose three sliding window features to accumulate past sentiments from the time series data. In this paper, we propose seven variations of Context-aware Sliding Window (CSW) features on different machine learning and deep learning algorithms. Furthermore, we propose a temporal dataset of user tweets, which is manually labeled by nine human annotators. The proposed dataset consists of 36 users having 4,557 tweets. Results indicate significant improvements over six state-of-the-art baseline methods.


I. INTRODUCTION
Over the past decade, many social media platforms have gained popularity by allowing users to freely share their opinions and thoughts. These opinions and thoughts are often analyzed using automated sentiment classifiers to help various industries such as the airport services quality [1], food companies [2], many others [3], [4]. Sentiment classification is an active area of research; numerous methods have been proposed to classify sentiments [5]- [7]. Most of the past research uses the contents of the current tweet to identify sentiments. Although such kind of sentiment classification technique might be effective, it can be enhanced by utilizing sentiment clues from the past. Past experiences of a person affect his or her present sentiments. For example, if a person is angry, his or her sentiments would likely remain the same for some period of time.
To utilize the sentiment clues from the past, we propose a context-aware sliding window (CSW) features. CSW features accumulate past clues from sentiments within a given time frame to classify the current tweet. We propose two categories of features in addition to the content features (1) sliding The associate editor coordinating the review of this manuscript and approving it for publication was Bilal Alatas . window and (2) ratio of sentiments features. The first category accumulates the past sentiments within a window size. The window size can be either temporal (last t minutes) or non-temporal (last k sentiments). The temporal sliding window assumes that the user's activities change over time, therefore the temporal window captures most recent activities. Generally, the temporal window accumulates sentiment clues for a specific time period, which is useful in attaining information from frequent users, who tweet regularly and normal users. However, a limited time span may not be useful for users who do not share tweets regularly, thus missing sentiment clues in the temporal sliding window feature. For this sake, the last k sentiments feature extracts past sentiments regardless of the time, which reduces the number of missing values. Both the last k sentiments and temporal sliding window features gain insights from the recent past, but they ignore insights from the overall activities of a user. Therefore, the feature ratio of sentiments gathers all the past sentiments of a user. The ratio of sentiments feature aims to create a user profile, which can later be combined with other features to produce an accurate classification.
In this paper, we propose seven novel variations of the features for sentiment classification. These variations learn patterns from the users who have high, normal and low tweet frequencies. We analyze the accuracy of the proposed models in different window sizes, i.e., 30 minutes to seven days for the temporal sliding window and 1 to 5 for last k sentiments features. In the absence of a publicly available temporally labeled user sentiment dataset, we develop our dataset for evaluation. The dataset consists of 4,557 tweets from 36 users. The dataset is manually labeled by nine human experts. We analyze our results on a large list of window sizes to give a clear picture of the performance under different window sizes. Results suggest that the most effective time window is between 30 minutes to 24 hours, similarly, the best value of k is between 2 to 5. We compare our proposed feature-driven models with six state-of-the-art baselines from machine learning and deep learning domain. All of our proposed models have improvements over the baselines. Following are the key contributions of this research: 1) We propose three novel features (i.e., temporal sliding window, last k sentiment and ratio of sentiments) for sentiment classification. We also propose seven variations of these features to learn patterns regardless of the user tweet frequency (i.e., high, normal, or low). 2) We analyze results on various temporal and nontemporal window sizes. 3) We present a new user-based temporal sentiment dataset of 4,557 tweets from 36 users. The rest of the paper is organized as follows: Section II discusses the background and related work on the sentiment classification. Sections III and IV introduce our proposed features and research questions for this study. Section V introduces the dataset and the labeling mechanism. Section VI discusses the results. Finally, Section VII presents our conclusions.

II. BACKGROUND AND RELATED WORK
Significant research has been carried out in the area of sentiment classification. We divide sentiment classifiers into four categories, i.e., lexicon, machine learning, deep learning, and temporal models.

A. LEXICON BASED SENTIMENT CLASSIFICATION
A lexicon is a dictionary of words, where each word in the lexicon has a positive, negative, or neutral polarity score. A lexicon is not limited to words, rather it includes emojis with their respective polarity scores. Therefore, some researchers apply polarity scores from emojis to determine the sentiments [8], [9]. Words in a tweet are matched with words in the lexicon. Polarities of the matched words are aggregated to measure the overall polarity of the tweet [10]- [12]. This simple matching technique does not consider the semantics or emotions of a user. Besides that, these positive and negative words are limited in a specific lexicon, so the researchers extend the lexicon by adding co-occurring words [13], [14]. Co-occurring words can effectively extend the lexicon but they may add semantically unrelated words. Therefore, Saif et al. [15] proposed a model SentiCircles to group similar co-occurring words using the term frequency-inverse document frequency (TF-IDF) and prior sentiments of the words. This way, authors were able to extend the lexicon with semantically related words.
This expansion might be ineffective for social media users, who use acronyms and misspelled words. Satapathy et al. [11] proposed to correct the spelling mistakes with the phonetic sounds of a word using Soundex. Moreover, they also replaced the abbreviations with their full forms using a fixed abbreviation list.
Lexicon based techniques must incorporate negation words because they can flip the overall sentiment of a tweet. Mukhtar and Khan [16] proposed a total of 154 features, among them, some features obtained the occurrence and placement of the negation words to classify the sentences with negation.
In a social network, users may share tweets in multiple languages; in such a case, it might be difficult to devise a multi-lingual lexicon. Therefore, Asghar et al. [17] proposed some rules to classify the sentiment of a tweet. Rule-based techniques can be inaccurate because they rely on a fixed set of rules, which may become outdated after some time.
Lexicon based techniques are effective and easy to implement because they do not require any labeling. On the contrary, lexicon-based techniques rely on dictionaries, so they might not handle slang, acronyms, and misspelled words.

B. MACHINE LEARNING BASED SENTIMENT CLASSIFICATION
Sentiment depends on many features; some of them are taken from contents while others come from user's profile, images, URLs, etc. Together these features make a large search space for manual classification; therefore researchers use machine learning algorithms to effectively find patterns in high dimensional data.
Similarly, some researchers proposed to apply textual features like n-grams, bag of words (BOW), and term frequency-inverse document frequency (TF-IDF) in machine learning algorithms to classify sentiments [13], [18], [19]. On the other hand, some researchers devised a hybrid technique, which combines lexicon and machine learning together to recommend sentiments [14], [16], [20]. A tweet may consist of different opinions toward multiple entities; therefore it is important to link sentiments with entities/organizations. In such a case, these lexicon features may not be effective because they can not identify the relationship between words.
For example, a tweet saying that ''I love chocolate cookies'' is showing a positive relationship between the person and chocolate cookies. Therefore, researchers adopt POS tagger to first identify the entities and then find the sentiments towards those entities [10], [13], [21].
All the above techniques do not consider context or semantic similarity of words during classification. The context of words can be incorporated in a machine learning algorithm using word embedding [18]. This helps machine learning algorithms to learn densely distributed representation for each word. Similarly, some researchers devised a temporal embedding of users and products, this embedding helps in learning the patterns from the history (of users and products) while recommending the current sentiment [22].
Most of the machine learning techniques rely on supervised learning, which requires manual or semi-automatic labeling of data. Labeling is a time-consuming activity, therefore Pandey et al. [20] proposed a hybrid learning mechanism that creates clusters from the initial samples and supplies them to Cuckoo search [23] for classification. Cuckoo search is a mechanism that takes clusters (initial samples) and then generates a new population by pairing the best samples (using a fitness function). Such an approach is effective for large unlabeled data, but it may get stuck in the local optima because of the inaccurate clusters.

C. DEEP LEARNING SENTIMENT CLASSIFICATION
Over the past, machine learning and lexicon-based techniques have been used for sentiment classification. Some researchers modeled features to enhance the functionality of the learning algorithms [13], [18], [19]. However, feature modeling requires careful shortlisting of the discriminating features; Deep learning automatically learns discriminating features from the data. The process of deep learning starts with learning word embedding to have a similar representation for similar words. However, similar words may have opposite sentiments; in such a case, word embedding will have the same representation for opposite sentiment words -thus misleading the classifier. Therefore, Tang et al. [24] focuses on learning a sentiment specific embedding. The idea is to separate opposite polarity words (for example, good and bad) in the neighboring vectors. In simple terms, the authors emphasize on the sense of the word while creating the embedding. In the same manner, Chen et al. [25] use emojis and an attention mechanism to emphasize the sense within the emojis while creating an embedding. Emoji, word, or sentiment specific embedding do not consider the variation in the sense under different topics. Therefore, Zhao and Mao [26] have applied the LDA model to learn topic embedding for sentence classification. Most of the word embedding based techniques rely on the word with no or minor spelling mistakes. Since Twitter puts a limit on the number of characters, so tweets also contain acronyms and misspelled words. Therefore, authors in [27], [28] have proposed character level embedding to learn embedding for multiple languages with spelling mistakes.
After learning word embeddings, researchers have applied it on various deep learning techniques like Convolutional Neural Network (CNN), Gated Neural Network (GNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), etc. CNN is a deep neural network with hidden layers having filtering (convolution) and pooling function to learn patterns in the data. The pooling function reduces the dimension of the input layer by extracting maximum and minimum values within bins (region of interest). CNN naïvely assumes that the layers are independent of each other. Therefore, LSTM, RNN, and GNN assume the output of the current layer is dependent on the computations in the previous layers.
With the same assumption, Tang et al. [29] have used Gated Neural Network (GNN) to model documents sequentially. The notion of the model is to predict the sentiment of the current sentence on the last sentence's sentiment. This idea differs from our approach in two ways: First, we hypothesize that sentiments are dependent on the recent context, which can be learned from variable window sizes rather than the last sentiment. Second, we are relying on Twitter users who use informal communication without any structure, whereas a document is formally written.
Some researchers [30], [31] have emphasized on the entity/topic of the tweet during classification. In the same manner, Shi et al. [32] emphasize on the hierarchies by proposing hierarchical LSTM with textual and user profile features. Each deep learning method has its positive and negative aspects, therefore, researchers have proposed ensemble techniques with CNN, RNN, and LSTM to avoid the drawbacks of each model [33], [34]. In the same fashion, Yang et al. [35] interpolated the output of CNN and LSTM by weighting the predictions of both the models. Generally, RNN has the problem of vanishing gradient descent (i.e., the computation in the long distant layers vanishes after a few layers), so Ding et al. [36] proposed to directly connect all layers to reduce the effects of gradient descent.
Deep learning methods are effective in sentiment classification but they require a large annotated dataset. Therefore, Yang et al. [37] have proposed a sentiment classification technique that uses both the supervised deep neural network and unsupervised probabilistic generative model. On the other hand, Xu and Tan [7] have proposed a semi-supervised learning method using a deep generative model to utilize the structure of labeled data onto a large unlabelled data.

D. TEMPORAL SENTIMENT CLASSIFICATION
Since users share their opinions over different time spans, therefore it is essential to learn the historical patterns from the temporal dimension. Temporal sentiment classification focuses on time series (temporal dimension) data to identify the sentiment of a tweet. One such research has applied user and product embedding on the temporal dimension to identify the sentiment [22]. The idea is based on the assumption that a user gives similar sentiments to similar products in the future. Users/products embedding uses the recent trends of products and users to identify the sentiment. However, the assumption of the same sentiments for similar products can be misleading because humans also give equal importance to their experiences alongside the history of the product.
Some researchers [21], [38] have focused on an analysis of user's sentiments over time.
First, they identify the sentiment of the tweets, later these sentiments are used to analyze the distribution of user's sentiments. The authors concluded that users tend to share more positive sentiments, however, on some occasions (like terrorist attacks, corruption cases, etc.) negative sentiments tend to increase [38]. Therefore, it is important to understand the context before identifying the sentiments. The authors in [39] have considered the context by taking the last conversation of the current tweet. The authors extracted the conversation by grouping tweets using mentions. Authors proposed to learn the context using LSTM, which can learn long distant sentiments context. Conversation based models are only capable of learning the context within the conversation; however, they ignore the overall context of a user. These techniques are different from our research because we emphasize on the user's recent context, this recent context can help recommend accurate sentiments for both conversational and non-conversational tweets.
Few research articles have focused on the user's recent context while identifying the sentiment.
Therefore, we intend to use the temporal dimension by hypothesizing that the current sentiment of a user is dependent on his recent sentiments.

III. CONTEXT-AWARE SLIDING WINDOW (CSW)
In this paper, we propose a method that takes the past sentiments along with contents to identify the sentiment of the current tweet. We hypothesize that the sentiment of the present tweet depends on the recent moods of the user. For example, a user with a high number of negative tweets (in the recent past) will have a high probability of sharing negative content in the current tweet. Therefore, we propose to utilize the past sentiments of a user. Besides that, we emphasize to learn the profile of a user to use patterns of similar users in identifying the sentiment of the current tweet.
We propose three categories of features: the first category is the contents feature, which mainly constitutes bag of words model. The second category has sliding window features, which define the size of the window (either temporal or nontemporal) to accumulate the sentiments within the defined window size. The third category gathers the overall personality of a user by computing the ratio of positive, negative and neutral tweets. These features are applied to machine learning and deep learning algorithms to identify sentiments. The details of the categories are given below:

A. CONTENTS FEATURES
Contents features are widely used for sentiment classification [13], [18], [19]. There are many ways to model the content features like bag of words model (BOW), n-grams, etc. In this research, for machine learning algorithms we use the BOW model because of its simplicity. For deep learning algorithms, we use GLOVE embedding [40] to obtain a lower-dimensional vector representation for each word.
Contents features are the key indicators to determine the sentiments in the textual documents. However, tweets have few words, which makes it difficult to accurately identify the sentiments. In this research, we used the following content features: • Bag of words: This model is used for the machine learning models. The weights are given based on term frequency-inverse document frequency (TF-IDF), which gives high weight to rare words in the corpus.
• GLOVE embedding: This lower dimensional vector representation is used for all deep learning models.

B. SLIDING WINDOW FEATURES
Generally, human beings tend to share their current opinions based on their prior thoughts or opinions. These prior thoughts or opinions can be accumulated using the past sentiments of a user. Therefore, we propose a sliding window mechanism (in sentiment classification) -a mechanism well known in event detection, where the aim is to analyze words or hashtags within a time window [41], [42]. With the same approach, we based this feature on the assumption that the current tweet's sentiment is dependent on past sentiment(s). In this paper, we propose two sliding window features i.e., temporal sliding window and last k sentiment. The details of each feature are given below.

1) TEMPORAL SLIDING WINDOW
The temporal sliding window aggregates the sentiments of a user within a time span. The temporal sliding window aims to consider the most recent sentiments of a user in identifying the current sentiments. Figure 1 illustrates an example of the temporal sliding window applied for two users. The example uses the window size of one hour, therefore tweet # 2 of user # 1 is aggregated as Null (as shown in Figure 1), because the time difference is greater than one hour. Similarly, tweet # 3 of user # 1 has a value of one in its temporal window feature because only the second tweet of user # 1 was within one hour aggregated time window. Algorithm 1 shows VOLUME 8, 2020 the process of the temporal sliding window-based feature extraction. The process takes the samples U i of the i th user and the window size as inputs. Afterwards, the algorithm sorts all the tweets temporally. From the sorted tweets, it computes the time difference (TD pc ) of all the previous tweets of the i th user where W size represents the window size) and stores the average of the sentiments satisfying the window size. The algorithm returns the temporal window features.
The temporal sliding window assumes that a user shares his opinion based on his present and past sentiments. For example, a person feeling angry after losing his job is likely to share more negative tweets that day. In such cases, the temporal sliding window will have more negative tweets in the recent past, so the current tweet is likely to be negative unless the content features have a strong recommendation towards positive sentiments. Similarly, a user may share more positive tweets if he is happy. Therefore, we hypothesize that the user's sentiment in the current tweet is dependent on his most recent sentiments. The goal of the learning algorithm is to use the recent sentiments of a user.
The temporal sliding window aims to learn patterns from recent past activities. We take an average of the sentiments in the sliding window. Our proposed temporal sliding window learns useful patterns like the polarity of the current sentiment with respect to the changes in the sentiments over the past.

Algorithm 2 Algorithm for Computing Last k Sentiments Feature
Input: All samples of a user U i , K value: The temporal sliding window is effective in gathering recent past activities. However, a tweet has Null value as current sentiment when there is no tweet within the specified time window. As an additional feature, we extract the last k sentiments regardless of the time. Figure 2 shows an example of aggregated sentiment with K = 2. Algorithm 2 shows the process of extracting this feature, it iterates over the previous K tweets (S i(t−1) → S i(t−K ) , where t is the index of the current tweet) to produce average sentiments from the previous tweets of a user. This feature helps in adding value to the users tweeting infrequently. Infrequent tweets induce many missing values in the temporal sliding window feature space. Therefore, the last k sentiments feature can help learn patterns from both short and long term activities of the users. The last k sentiment feature contains fewer missing values, thus allowing the learning algorithms to find patterns within a larger search space. We hypothesize that the last k sentiment feature can help the learning algorithms to learn patterns from a large array of sentiments (of a user). Moreover, we hypothesize that this feature obtains a broader picture of the user by incorporating sentiments from both long and short term tweets.

Algorithm 3 Algorithm for Computing Ratio of Sentiments Feature
Input: All tweets S i of a user U i Output: Ratio of sentiments RS Sliding window features are essential in learning the behavioral patterns of the users over a time span, however, they lack in handling frequent variations in the sentiments. For example, a user might change his sentiments frequently, therefore, he may have a mix of positive and negative tweets. Such behavioral patterns within a sliding window may have mixed sentiments. So, past sentiments may not give an overall picture of the user's personality or his way of thinking. Therefore, we propose to compute the ratio of positive, neutral and negative sentiments from all the past tweets (i.e., S i0 → S i(t−1) ). Algorithm 3 shows the process of extracting the sentiment of the past tweets and then computing the percentage of all positive, neutral and negative sentiments. Figure 3 shows an example of ratio computed from all the past tweets, thus giving an overall behavior of a user. We separate the average of positive, negative, and neutral tweets in the ratio of sentiment feature because we use user's variations in the past, which might be ignored if we take the average of all the sentiments. In simple terms, we apply the ratio of positive, neutral, and negative sentiments to correlate patterns within these features with the current sentiment. For example, if a user has a high average of positive and negative sentiments in the past then, likely, his current tweet may not be neutral. To find correct sentiment for a tweet, Algorithm 3 specifies boundaries to consider a tweet as positive, negative, or neutral.
The ratio of sentiment feature builds a user profile by accumulating the overall sentiments of a user. This user profile helps in learning the general behavior of a user. For example, a user with mostly positive tweets has a high probability of sharing more positive tweets. This feature can be used with sliding window features to learn patterns from both the general and recent perspectives while classifying the current tweet.  As shown in Figure 4, this research question focuses on applying both the temporal sliding window and last k sentiments features in learning algorithms. The aim is to learn patterns from all kinds of users i.e., normal, frequent and infrequent users. Both these features can complement each other in different ways, like a normal user may have many missing values in the temporal sliding window, in this case, the last k sentiments feature will give an overview for that user. We test whether these features collectively can add value to the contents while classification. 5) Research Question 5 (RQ5): Can the temporal sliding window and ratio of sentiments features together help in classifying tweets effectively? This research question emphasizes to use temporal sliding window and ratio of sentiments as shown in Figure 4. Together these features provide an overview of the personality (extracted from the ratio of sentiments) along with the recent perspective of a user. Moreover, these features can learn changes in user behavior, like a person sharing positive tweets suddenly starts posting negative tweets. So, the learning algorithm learns such patterns, later these patterns can be applied to the user with similar behavioral changes. Since the temporal window can have missing values, so the ratio of sentiments feature can give useful clues. For example, a tweet after a week will be marked null in the temporal sliding window, however, the user history has more positive tweets. In this case, the learning algorithm uses patterns (from the ratio of sentiments) of similar users to predict the sentiment. This research question focuses on finding whether the recent perspective (from the temporal sliding window) and overall user behavior (ratio of sentiments) help in discriminating tweets. 6) Research Question 6 (RQ6): Can the last k and the ratio of sentiments features together help in classifying tweets effectively? In this research question, we use the last k and ratio of sentiments as features as shown in Figure 4. The idea is to combine the insights from both features in order to avoid any misleading information. For example, a user who posts 20 tweets in an hour and most of them are positive except the last few; assuming that we set the value of K=5, this means that the last k sentiment provides incomplete information. In this case, the overall view provided by the ratio of sentiments feature can give complete information about that user. Therefore, these features together might overcome incomplete information.

7) Research Question 7 (RQ7): Can the temporal sliding window, last k and the ratio of sentiments features together help in classifying tweets effectively?
In this research question, we provide all the perspectives (i.e., recent, last k and the ratio of sentiments) to the learning algorithm as shown in Figure 4. These features provide learning algorithms to learn a broad set of patterns within the user's recent, long term, and overall behavior. Since the user behavior changes over time, so these features together can correlate the patterns from recent activities to variable time span (last k sentiments) and overall sentiments (ratio of sentiment). For example, a user who has overall negative tweets may sometimes share a number of positive tweets in a short time span. We test the assumption that the correlation of behavior patterns in the recent, variable time span and overall activities can help in finding sentiments of a tweet.

V. DATASET
To evaluate the research questions, we require a labeled dataset of all the tweets in the user timeline. In the absence of a publicly available temporally labeled sentiment dataset, we develop our dataset using the process shown in Figure 5. The first step is to select a list of users. To gather an unbiased and diverse set of users, we use the list of top 100 influential people published by TIME magazine in 2018. 1 The list constitutes of people with diverse professions including actors, comedians, musicians, politicians, activists, and leaders. We choose influential people because they follow diplomacy and act as agenda seekers to gain more traction in the community [43], [44]. Therefore, these influential people follow diplomacy by using a specific set of vocabulary to avoid lacerating anyone in the community. On the contrary, 1 https://web.archive.org/web/20181014032646/http://time.com/ collection/most-influential-people-2018 common people may use simple vocabulary to show their sentiments, which makes it easy for the classifier to identify sentiments. In this research, we intend to apply the proposed technique on a challenging task of learning different aspects of the diplomatic responses from influential people. Generally, the accounts of influential people stay active forever, whereas some of the regular accounts get suspended. Besides that, the accounts of influential people are active in sharing their thoughts, which allows us to obtain a large number of tweets per user. We believe that the proposed method would perform much better on datasets of common users because of their diverse vocabulary, which would help the classifiers to learn separation boundaries easily.
In the second step, we manually searched for the Twitter accounts of these users. We only considered the Twitter accounts that matched with the person's profile in the TIME magazine. Some of the influential people did not have a Twitter account or their account privacy was set to private, therefore, we did not add those accounts. Eventually, we were left with 71 accounts and then we used Twitter API to download the recent upto 3,200 tweets of each account. In the process, we downloaded a total of 159,334 tweets.
The third step is to remove noise from the dataset. To remove noise, we applied three filters: the first filter removed all the retweets because they are not written by the users themselves. The second filter removed accounts with less than 40% of English tweets. In the third filter, we removed the accounts with less than 100 tweets. After applying the filters, we took the last two months of tweets to reduce the size of the dataset for human annotators. Finally, our dataset consisted of 4,557 tweets generated by 36 users as shown in Table 1. Two months duration contains sufficient tweets per user to train a machine learning model. A suitable learned model can later be used for classifying tweets in real time on a large scale. The dataset contains 53% of positive tweets, which is normal because these influential users may share more positive tweets. Furthermore, we can not use a balanced dataset for all kinds of sentiments because this research uses many tweets of a user, in which we do not have control over the distribution of sentiments.
In the last step, we ask nine human annotators to label the dataset. We gave an introductory session to each annotator to briefly explain the basic concept of sentiments.
For the labeling process, we split the data into 5 parts, each part contained all the tweets of seven users on average. Each part contained the ID of the users and temporally sorted tweets; ID helped the annotators to understand the context of the current tweet by referring to the previous tweets of the same user. Moreover, every part is labeled by two different annotators. We computed Cohen's kappa score to measure the inter-rater agreement, the kappa score was 0.62, which is considered as a substantial agreement. This kappa score shows that finding the correct sentiment for these diplomatic entities is difficult, even for humans. We only considered tweets having an agreement among the annotators. Afterwards, the dataset consisted of 3,483 tweets. To preprocess the data, we removed the stopwords and replaced URLs, emojis, user mentions with constant markers like URL, EMO_POS, USER_MENTION, etc. We used public emojis list 2 to classify the emojis with the sentiments. We used markers for the learning algorithm to see more useful patterns within it. For feature representation, we used the bag-of-words approach for machine learning models and 100-dimensional GloVe [40] vectors representation for the deep learning algorithms.

VI. RESULTS
This section discusses the performance of the proposed CSW techniques for the sentiment classification. We propose seven novel variations of features to incorporate context alongside the contents feature. This section is structured as follows: Section VI-A discusses the experiment settings. Section VI-B introduces baselines for the comparison. Afterwards, Section VI-C presents and discusses the results.

A. EXPERIMENT SETUP
Standard methods for evaluating machine learning algorithms like k-fold cross validation are not appropriate for time series data. Therefore, we use the rolling window evaluation technique, which is commonly used to test the performance of the time series predictive systems [45]. It divides the data (of T samples) into chunks of size m, where 1 ≤ m < T . For each iteration i, the rolling window fetches sub-samples using the following equation: where i denotes the iteration number and W i denotes the i th window. In simple words, the first window contains 1 to m sub-samples, while the second contains samples from m + 1 to 2m and so on. We split each window into 90% for training and 10% for testing because it allows enough tweets per user for the learning algorithm to better learn the decision boundary. We run deep learning algorithms for 150 epochs and use a dropout of 0.1. Moreover, we use default settings for machine learning models. We set the value of m = T 5 and compute the F1-score, precision, and recall for each iteration. The final score is computed by taking the average of all the scores.
For the test data, features that depend upon past sentiments do not use manually labeled sentiments. Instead, we use sentiment clues. Sentiment clues are generated using NLTK 3 toolkit. Though sentiment clues are not accurate when compared to the manually annotated sentiments, they are useful in real-life scenarios where annotated data is unavailable.
For evaluating the temporal window, we use the window size ranging from 30 minutes to 7 days. We rely on the assumption that the sentiment remains the same for a short period of time, therefore we consider multiple window sizes within a day. For the last k sentiments feature, we evaluate the values of k between 1 and 5.

B. BASELINES
We use the state-of-the-art learning algorithms such as Support Vector Machine (SVM), Random Forest (RF), Stochastic Gradient Descent (SGD), Long Short-Term Memory (LSTM), Recurrent Neural Network (RNN) and Gated Recurrent Unit (GRU) as baselines.
All these baselines are provided with the content features, i.e., the BOW approach for the machine learning algorithms and Glove embeddings for the deep learning algorithms. Moreover, we also added an embedding layer in the deep learning models to learn new embeddings from the current dataset.
The proposed features can be applied to any kind of supervised learning algorithm. We choose widely used algorithms   in sentiment classification research. The same algorithms are used in the baseline for the proposed features. In addition, each proposed technique is targeting a research question discussed in Section IV.

C. RESULTS & DISCUSSIONS
As seen in Tables 2 to 8, the proposed features perform significantly better than the baselines. The best performance is achieved by combining all the features (i.e., temporal sliding window, last k and ratio of sentiments). Together, these features help in learning patterns from the user's recent and overall history; moreover, they also handle all types of users, i.e., normal, frequent, and infrequent. However, the F1-score values in the results are not very high because identifying the correct sentiments for influential people is a challenging problem. The Cohen's Kappa value for annotators agreement also draws the same conclusion, i.e., the complexity in assigning sentiments for such individuals resulted in decreased agreement among the human annotators. We discuss the results of each research question in the sections below.

1) RESEARCH QUESTION 1 (RQ1): CAN THE TEMPORAL SLIDING WINDOW HELP TO CLASSIFY TWEETS EFFECTIVELY?
This research question emphasizes on utilizing the most recent sentiments to derive the sentiment of the current tweet. As shown in Table 2, the performance of the proposed models is better than the baseline. Interestingly, almost 50% precision is achieved between 13 to 18 hours window size, which clearly shows that the current sentiment is derived from the recent mood. In the same manner, the F1-score also indicates a better performance within the time range of 6-18 hours because a wider window size captures enough information for the learning algorithms to effectively recommend a sentiment. On the other hand, the shorter time span may not be effective because of many missing (null) values. For example, a user might not share tweets regularly within six hours, therefore, a short time span may have many missing values. So, it is beneficial to take insights from other features while the temporal sliding window has missing values.
The temporal sliding window outperforms all the baselines. However, in a shorter time span, it achieves a relatively VOLUME 8, 2020  low performance because of missing values in the feature set. Table 2 shows that the temporal sliding window has a positive impact on the performance, i.e., between 6 to 18 hours window. In fact, the random forest-based model has achieved the highest F1-score of 46%. This performance can further be improved by adding more users in the dataset; that way, learning algorithms can learn a variety of patterns within different types of users. As per the results, the answer to this research question is yes, but we can further improve the performance by adding more data and features. This research question focuses on avoiding missing values by taking past sentiments regardless of time. This feature helps in obtaining information from the recent tweets irrespective of time. This feature also helps in getting an overview of the users who do not share tweets frequently. Table 3 shows that the proposed models have performed better with the last k sentiments feature. The window sizes between 2 to 4 have more discriminating results because these window sizes give a balanced view of both frequent and normal users. The performance starts deteriorating if we increase the value of k (≥ 5) because it may accumulate old sentiments for a user who rarely shares tweets. For example, the last 10 sentiments of the infrequent users may span over the last 10 years. Therefore, a value between 2 to 5 past sentiment gives a balanced view of all kinds of users.
The last k sentiments feature has significant improvements over the baseline. In fact, in some cases, the last k sentiments feature has better performance than the temporal window feature shown in Table 2. This shows that the last k sentiments feature handles the missing value problem by allowing learning algorithms to learn effective patterns for any kind of user.

3) RESEARCH QUESTION 3 (RQ3): CAN THE RATIO OF SENTIMENTS HELP TO CLASSIFY TWEETS EFFECTIVELY?
The ratio of sentiments gives an overview of a user. According to Table 4, this feature did not have a significant increase in performance because the overall view may add confusion. For example, a user who mostly posts negative tweets in the past occasionally shares positive tweets. In this case, the ratio of sentiments feature will recommend classifying the current tweet as negative unless the contents feature have a strong recommendation towards positive sentiment. Therefore, it is necessary to add more features to detect recent changes in activities.
Answer to this research question is considered plausible because the ratio of sentiments feature has slight improvement; however, it can be more effective if we merge both the recent and the overall perspective of a user. For brevity, in the next research questions, we only present the F1-scores. This research question focuses on handling all kinds of users. Table 5 shows the increase in performance with both these features together. Similar to results in Section VI-C.2, the proposed features have the best performance when the value of k is between 2 and 4. In the same manner, the proposed models perform better in the temporal window size within 24 hours. In this research question, machine learning algorithms have performed better than deep learning algorithms because both these features (i.e., temporal sliding window and last k sentiments) make up of four variations, i.e., both positive, both negative, one positive & one negative and vice versa. These four variations combined with the contents form a large number of permutations, thus deep learning algorithms may require more data to effectively learn the weights for each variation. The results highlight that the majority of the proposed feature driven models have obtained a significant increase in F1-score.
Answer to this research question is yes because the majority of the proposed feature driven models have significant improvements in terms of F1-score. Moreover, deep learning models have slight improvement over baseline, however we hypothesize that more data will help learning algorithms to learn weights with respect to different scenarios (particularly in which both features have different sentiments).

5) RESEARCH QUESTION 5 (RQ5): CAN THE TEMPORAL SLIDING WINDOW AND RATIO OF SENTIMENTS FEATURES HELP TO CLASSIFY TWEETS EFFECTIVELY?
The focus of this research question is to combine recent and overall sentiments to give a holistic view of a user. Since the temporal sliding window feature has missing values, in such a case, the ratio of sentiments allow learning algorithms to learn information from the overall perspective of a user. On the contrary, the ratio of sentiments feature can not extract changes in the recent activities, so the temporal sliding window complements it by adding recent activities of a user. Table 6 validates this research question, as these features lead to significant improvements over the baselines. In this research question, the ratio of sentiment feature compliments the temporal window feature. For example, when both features point to the same recommendation, the classification may have higher confidence. On the other hand, if one feature has weak recommendations then the other feature can add its weight to help in effective recommendations. Consider a user who has a similar ratio of positive or negative sentiments in the ratio of sentiments feature, whereas the temporal sliding window feature has more positive sentiments. In this case, the recommendation from the ratio of sentiments feature may have a low probability, thus combining the recommendations from the temporal sliding window feature may increase the confidence of the recommendation. Both of these features work in conjunction with the contents, so the final sentiment is based on both these features. Together these features create dependency, which can help in disambiguation. Table 6 shows that the best performance is between 13 to 24 hours of window sizes, which means that the user's tweets are influenced by his moods within a day. Besides that, the results also show a significant increase in the recall values, which means that the proposed feature-based models have more coverage. We use the last k and ratio of sentiments features to address this research question. The idea is to compliment the last k sentiments feature with an overall view of a user, like the last k sentiment feature gives a specific view of a user, which may not highlight user's trends, whereas the ratio of sentiments helps provide a broader perspective of a user. For example, the last 4 sentiments have equal positive and negative sentiments, now the algorithm may rely more on the ratio of sentiments and contents features to classify a tweet. Results in Table 7 confirm that the feature-driven model outperforms the baseline. The results show a significant increase in k = 5, this shows that the overall view from the ratio of sentiments feature has helped in further discriminated tweets.
It is evident from the answer to this research question is yes. Moreover, the feature-driven models attain a significantly higher F1-score than the baseline, which depicts that This research question focuses on using all the features to effectively classify a tweet. This research question aims to create dependencies among all the features, which allows the learning algorithms to correctly classify even if any feature has missing values. Table 8 shows that all features together gain the significantly improvement compared to the models in Sections VI-C1 to VI-C6. Together these features cover the user's recent and overall activities. Besides that, these features play a vital role when temporal window feature has missing values, in such cases the remaining two features become useful for the classifier. This can also be seen in the results that the window sizes between 30 minutes to 24 hours achieve the significant performance even with the missing values in the temporal sliding window. Most of the feature-driven models have achieved better performance regardless of the window size, which means that together these features allow learning algorithms to learn more discriminate patterns from the data. The best F1-score is close to 50%, which also elaborates that the proposed feature-driven models perform better than the state of the art baseline. Table 9 shows some examples of the output produced by the proposed model (with all the features); the table covers three tweets of different users. These tweets are not entirely classified on the contents rather the model has used overall user profile and previous activities to predict the sentiment. For example, the first tweet is classified as negative, although the contents did not have many negative words. In this case, the person's most recent and overall sentiments were negative. So the learning algorithm correctly disambiguate the context of the current tweet with respect to the past tweets. The second example shows a tweet of an activist with mixed sentiments. In this case, the contents have more positive words (like Timeless rock, Timeless for smart people, etc.) and past sentiments do not portray many negative sentiments, therefore the classifier correctly classifies it as positive. In the third example, a user often criticizes the elected officials on their actions. Although the contents have some positive words but the features on the past sentiments may have a higher probability towards the negative sentiment.
The answer to this research question is yes, and it is evident that together all features have significant improvements. Moreover, these features help in disambiguating the context by incorporating the previous activities of a user. Together all these features make classification a two steps process. The first process creates patterns by correlating the values from all features. The second process ensures that the prediction relies on both the contents and past patterns.

VII. CONCLUSION AND FUTURE WORK
In this paper, we propose seven context-aware sliding window models along with three novel features for sentiment classification. The aim is to incorporate past insights during the classification of the current tweet. The proposed models can disambiguate the context from the recent activities and history of a user. Moreover, the proposed models can effectively classify tweets of a user with the variant frequency of tweets. All the proposed models have improvements over all the six state-of-the-art baselines. Among all, the best performing model uses all the features together, which emphasizes that the past sentiments (temporal or nontemporal) and overall sentiments give effective indications to the classifier. The results also portray that the window size within 24 hours effectively captures recent mood(s) during classification. Moreover, the non-temporal window size of the past 2 to 5 sentiments provides a good balance between the insights of long and short time span. Lastly, the ratio of sentiments feature may not be effective unless combined with other features. The findings of this research will help in choosing certain window sizes for better classification.
We also present the first user-based temporally labeled dataset of 4,557 tweets labeled by nine human annotators. The labeling process achieves satisfactory agreement among the raters. The proposed work presents a new dimension by using the time series to classify sentiments; therefore, it can act as a benchmark for future researchers. In future, we intend to use the attention mechanism to signify the importance of some features. Moreover, we plan to extend the dataset to include a wide variety of contents.
MUHAMMAD ALI MASOOD is currently a Ph.D. Scholar with the Department of Computer Science, Quaid-i-Azam University, Pakistan. His research interests include machine learning, sentiment classification, user classification, cybercrime, social network analysis, and big data analytics. The applications of his research include identifying and predicting outlier behavior through social media analytics and social network analysis.
RABEEH AYAZ ABBASI received the Ph.D. degree from the University of Koblenz-Landau, Germany, in 2010. Since 2011, he has been an Assistant Professor with Quaid-i-Azam University, Islamabad, Pakistan. He has a vast research experience in the fields of social media analytics and social network analysis. His research focuses on leveraging positive aspects of social media, including social media's use in saving lives, understanding events, and analyzing sentiments. He has published more than 35 articles in reputed journals, such as the IEEE Computational Intelligence Magazine, the Computers in Human Behavior, the Telematics and Informatics, the Applied Soft Computing, and Scientometrics, as well as the international conferences, such as the ACM Hyper-Text Conference, the ACM World Wide Web Conference, the Pacific Asia Conference on Knowledge Discovery and Data Mining, and the European Conference on Information Retrieval.

NG WEE KEONG is currently an Associate
Professor with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. He works in machine learning, privacy-preserving techniques, query-permissible encrypted databases, enterprise blockchain systems, data security, and blockchain security. His work is motivated by the need to harness the power of data for the betterment of stakeholders, where the data has confidentiality and privacy issues; where the data may be owned and held by different stakeholders, and where the data is so large that it must be hosted in cloud servers that are never completely secured (susceptible to insider and cyber-attacks). He contributes to companies and industries as technology consultant on projects involving data analytics, artificial intelligence, data privacy and security, and blockchain. His work is supported by various research grants. He is an Associate Editor and also a member of the Editorial Boards of five journals. Over the years, he has worked with many talented students on their Ph.D. and Master theses. He writes for online magazines and newspapers and contributes to social causes.