On the Statistical and Temporal Dynamics of Sentiment Analysis

Despite the broad interest and use of sentiment analysis nowadays, most of the conclusions in current literature are driven by simple statistical representations of sentiment scores. On that basis, the generated sentiment evaluation consists nowadays of encoding and aggregating emotional information from a number of individuals and their populational trends. We hypothesized that the stochastic processes aimed to be measured by sentiment analysis systems will exhibit nontrivial statistical and temporal properties. We established an experimental setup consisting of analyzing the short text messages (tweets) of 6 user groups with different nature (universities, politics, musicians, communication media, technological companies, and ﬁnancial companies), including in each group ten high-intensity users in their regular generation of trafﬁc on social networks. Statistical descriptors were checked to converge at about 2000 messages for each user, for which messages from the last two weeks were compiled using a custom-made tool. The messages were subsequently processed for sentiment scoring in terms of different lexicons currently available and widely used. Not only the temporal dynamics of the resulting score time series per user was scrutinized, but also its statistical description as given by the score histogram, the temporal autocorrelation, the entropy, and the mutual information. Our results showed that the actual dynamic range of lexicons is in general moderate, and hence not much resolution is given within their end-of-scales. We found that seasonal patterns were more present in the time evolution of the number of tweets, but to a much lesser extent in the sentiment intensity. Additionally, we found that the presence of retweets added negligible effects over standard statistical modes, while it hindered informational and temporal patterns. The innovative Compounded Aggregated Positivity Index developed in this work proved to be characteristic for industries and at the same time an interesting way to identify singularities among peers. We conclude that temporal properties of messages provide with information about the sentiment dynamics, which is different in terms of lexicons and users, but commonalities can be exploited in this ﬁeld using appropriate temporal digital processing tools.


I. INTRODUCTION
New technologies have brought in the very last few years new and efficient ways of working, as well as new and The associate editor coordinating the review of this manuscript and approving it for publication was Nadeem Iqbal .
very straightforward communication methods. The benefits of these technologies as a form of communication are very well known, and instant messaging is boosting the socialization of comments, ideas, and concepts, almost in real time. Sometimes impulsive and some others well sustained, but always interactive and inexpensive, these social media communications have been pointed as a potential source to identify and measure the underlying sentiment of people about tagged companies, persons, or ideas. Sentiment analysis could be emerging as a solid, interesting, and relevant tool to consolidate individuals themselves, but also their moods, opinions, and feelings [1], [2]. Traditional research in this area used to incorporate extensive and effort intensive social polls and quits. But nowadays, new technology offers a brand-new and powerful novel way of doing so, together with the favor of general affection for social media, websites, forums, blogs, and microblogs, among other sources [3]. Sentiment Analysis has proven to be a remarkable instrument in different research areas and for diverse applications, including examples such as capturing public opinion and investor sentiment about finance issues, stock movements, products, or companies [4], [5]. These efforts are challenging both for researchers and investors [6], [7], as the forecast of stock-price has been a continuum goal for decades, [8], [9], and current always-on availability of social media references seems to be bringing into the scene an unprecedented amount of new available information for its possible evaluation.
But this is not all the landscape. In addition, specific studies have focused on particular and technical aspects of this phenomenon. As an example, the relationship among social media and financial markets has been analyzed through the specific case of a tweet sent from a hacked Associated Press Twitter account [10]. Another specific could be seen on how Twitter broadcasting power has been scrutinized in relations to Corporate Social Responsibility, and market prices [11]. So, it is in the general consensus that Twitter sentiment and stock price evolution are related, in particular, on how the volume of tweets appears to be related to the abnormal growth in the daily return of a stock over a certain period, rather than to feelings and price [12]. Also in this direction, a weighted sentiment measure was recently constructed by using tweet messages from a financial community in Twitter, which allowed the researchers to conclude that this community is a robust predictor of financial markets [13].
The present work was proposed in order to proceed in this direction and to scrutinize the statistical and temporal properties of sentiment when measured in social media, such as in the example of Twitter. We evaluated the nontrivial statistical and temporal properties of the relevant and generally available methods hereafter described for sentiment quantification through an aggregating and benchmarking approach from a number of individuals, industries and their populational trends. In particular, we analyzed the existing variables using different statistical views, including histograms distribution, autocorrelation of time-dependent statistical abstract variables, time-domain evolution of all variables, the entropy of significant variables to evaluate the amount of relevant existing information, and the mutual information among said variables. This work analyzed not only the complete set of tweets of the selected users, but also the differential behavior, including and excluding retweets, using a number of different lexicons, as well as the inter-industry and intra-industry.
Additionally in order to consolidate and uniquely quantify the positivity attached to a certain user, a novel Compounded Aggregated Positivity Index is presented and evaluated.
Under this perspective, and even although a vast amount of experiments, databases, and algorithms have already been described, no systematic approach exists, to our knowledge, allowing us to benchmark the different strategies, lexicons, and semantic analysis techniques in the field of sentiment analysis. Therefore, in this work we conjecture whether these available lexicons, databases, or techniques, especially all those labeled from an emotional perspective, could eventually drive a coherent and unbiased handler to qualify or quantify the consolidated underlying mood or sentiment behind all intertwined communications. We understand that a positive response could lead to a wide number of applications, as there are already different commercial and free tools offering the companies graphical and numerical perspective in terms of their social media exposure. Results revealed interesting patterns whether they were general, specific for certain industries and also singular conduct of certain users within industries.
This paper is structured as follows. Section II reviews the background and the literature in this field, including both the general literature as well as the relevant techniques, approaches, and databases used. Section III describes the methods and the experimental setup. Sections IV and V incorporate the experiments with their results, and the conclusions, respectively.

II. BACKGROUND
A wide diversity of published studies on the subject is present in the literature about the sentiment analysis topic. This is especially true recently, revealing the potential and topicality of the subject in many fields. This kind of analysis has been devoted, for instance, to consumer goods [14], [15], to brand recognition [16], to characterize political sentiment [17], [18], or to even real-time television scheduling of tv-programming [19], among others. But the area of stock markets and companies evaluation is probably the most intensively analyzed, as the titanic economic environment surrounding this specific topic encourages any potential new perspective that may provide tools or support for investor business cases. In this area, exogenous information such as news, political decisions, or customers purchasing expectation, has been historically proven to affect in shortor medium-term valuations of companies. Therefore, it is not uncommon to find that studies in this area are mostly focused in stock price, and especially in an attempt to forecast price and volume evolution of ultimate share value. As a consequence of this extensive coverage, we can find the corresponding bias on literature review, specially in terms of examples and references.
In an attempt to make this section independent from the final target pursued by authors in sentiment analysis, it is focused on techniques, tools, and databases used in the sentiment analysis process. For that reason, it is structured according to different approaches shown on literature for sentiment analysis, where we review the most common applications using Machine Learning Techniques, Lexicon-Based Techniques, and Hybrid Techniques.
We can define Sentiment Analysis as the required process to identify the underlying attitudes, opinions, or moods, expressed in a certain text data or aggregation of a number of organized characters. Sentiment Analysis could be either quantified by a number in a certain range, or categorized according to the kind of standard subjective statements, such as positive, negative, or neutral. In order to build up a comprehensive description of the different perspectives shown in the literature, we follow the review in [20], which organized the different possible strategies in three main collections, namely, Lexicon-Based, Machine Learning-Based, and Hybrid Approaches (see Fig. 1 for details). But not all three are seamlessly represented in research efforts, and just a few studies have combined Lexicon-Based together with machine learning methods, and achieved relatively better performance [21].

A. MACHINE LEARNING TECHNIQUES
Entering into what would probably be the most promising area of techniques for sentiment analysis, we should mention that machine learning techniques are commonly used nowadays to build models that can predict sentiment over pieces of text. A number of machine learning techniques have been adopted in this ground.
First, Natural Language Processing (NLP) is an interdisciplinary research field focused on enabling computers to understand and process human language input. NLP is used to parse written texts to infer their syntax and semantics [22]. Second, Case-Based Reasoning (CBR) is designed to solve a new problem by remembering a previous similar situation and by reusing information and knowledge of that situation [23]. Third, Artificial Neural Networks (ANN) are computational models inspired by the structure and elements of natural or biological neural systems. The structure of an ANN is designed to learn from the data that is presented to it and the output it gives. Using pre-labeled data, the error between the output and the input is minimized through an iterative process. Once the error reaches a certain low threshold, the network is ready to generalize over new data sets [24].
Finally, Support Vector Machines (SVM) correspond to supervised learning methods, and their principle is to improve the generalization capability of learning by seeking the minimum structural risk. As a result, they can obtain the model based on a limited training set and ensure lower error levels in the test, yielding acceptable statistical results even with lower size of statistical samples or with high-dimensional input spaces [25], [26].

B. LEXICON-BASED TECHNIQUES
Lists of words and their associated sentiment score are commonly referred to as sentiment lexicons, and they are widely used in sentiment analysis [27]. A number of different approaches have been described in the literature, and they could be classified according to: (i) The process followed to create the lexicon, i.e., whether it is manual or automatic; (ii) The elements used for classification, which could be based on single words, n-grams, phrases, or even just certain kinds of words such as adjectives; (iii) The classification model, depending on whether it works with discrete levels of polarity or gradual scales. This strategy often has the problems that the sentiment dictionary doesn't contain enough sentiment words, omits some field sentiment words, polysemic sentiment words and words' polarity. In this perspective we find some current and meaningful contribution for the sentiment recognition of the comment texts [28], [29]. The authors used different methods, some based on a naive Bayesian classifier (to determine the field of the text in which the polysemic sentiment word is), an extended sentiment dictionary and the design of sentiment score rules. While others authors build a Feature Ensemble Model (FEM) and a Convolutional Neural Model (CNN) for tweets containing fuzzy sentiment [30].

C. HYBRID APPROACHES
Hybrid approaches consist of using both machine learning and lexicons combined, to provide a better understanding of Sentiment Analysis. They are based on the idea that the complementary interleaving of these two very well-known methodologies may eventually lead to a better categorization and benchmark. We present and describe here some representative examples.
A good example of this could be the design of ANN for learning task-specific word embeddings in other NLP tasks [31]. In the same direction, other authors proposed Sentiment-Specific Word Embedding (SSWE) models. In [32], the authors extracted sentiment embeddings from tweets with positive and negative emoticons as distant-supervised corpora without any manual annotation. They verified the effectiveness of sentiment embeddings by applying them to three sentiment analysis tasks.
Empirical results showed that sentiment embedding outperformed context-based embedding on several benchmark datasets of the tasks. This study differs from many others in the field as they used word embedding to capture word similarities in terms of sentiment semantics.
Another example can be the use of linguistic features to detect the sentiment inside Twitter messages [33]. They used three corpora of Twitter messages. For development and training, they used the Hashtagged Dataset (HASH), which they compiled from the Edinburgh Twitter corpus, and the Emoticon Data Set (EMOT) from http:// twittersentiment.appspot.com. For evaluation, a manually annotated dataset was used, which was produced by the ISieve Corporation (ISIEVE). A different approach in this area tried to automatically collect a corpus for sentiment analysis and opinion mining on Twitter. The corpus contained 300 000 text posts [34]. In this work, an SVM was used to build a sentiment classifier that was able to classify positive, negative, and neutral sentiments for a document, using a previously proposed procedure [35]. Results indicated that this technique was more efficient than previously proposed methods.
One of the newest model of hybrid sentiment analysis is called model-SLCABG, which is based on the sentiment lexicon and combines Convolutional Neural Network (CNN) and attention-based Bidirectional Gated Recurrent Unit (BiGRU) [36]. They showed that this model can effectively improve the performance of text sentiment analysis in the online purchase satisfaction. They used the sentiment lexicon to enhance the sentiment features in the reviews. Then they extract the main sentiment features and context features in the reviews on e-commerce platforms by the CNN and GRU network. And finally they classify the weighted sentiment features.

III. METHODS AND STATISTICAL DYNAMICS A. TIME SERIES NOTATION
The message event generation and their repetitions can be viewed as temporal random processes, since they can be defined as point-processes. We propose here the following simple signal model. Let S(t, φ) denote a stochastic process defining the M -multidimensional time evolution of the sentiment of a population of M individuals with respect to some topic φ. Let s(t, f ) denote the observable fluctuations from this sentiment evolution, obtained by sampling at a set of times t j , and possibly at different users, this is, where f denotes the observable entity that is used as a proxy for the sentiment topic φ, t j is the time instant where a sentiment measurement is made through a given measurement instrument, and s j is this measurement. If we assume that the measurement instrument is a mathematical transformation made on the text content of a short message, then we can denote globally this transformation as , so that we can expressŝ where m j denotes the text message selected as measured entity, and denotes the lexicon the operator is linked to. In this case, t j is trivially given by the time instant the message is recorded from the environment, and N m is the number of messages.
B. TIME SIGNALS, M-MODE, AND AUTOCORRELATION The presented simple notation and signal model aims to make easy to follow the statistical and temporal process dynamics by using well-known Stochastic-Process Principles [37]. While convenient as notation, this makes evident two limitations at this point. First, point process s(t, f ) is hard to analyze in this form, as most of Stochastic-Process Theory works on temporal evolution according to a sample period T s providing a regularly-sampled discrete-time series from the continuous time original series [38]. The second limitation, directly due to the sampling process, is that populational information is aggregated and heavily condensed, hence mixing different basic properties of the underlying multidimensional sentiment process in a single measurement. Aiming to partially overcome these limitations, the following transformations can be done on the original set of messages m j collected in a time interval (t 1 , t N m ). By choosing a time integration window T w , it will play the role of effective time sampling to transform the point process into a discrete-time process. The total number of hits per window can straightforwardly obtained, given by where T n ≡ ((n − 1) T w , n T w ). Whereas this does not represent a sentiment measurement itself, it can be seen as a measure of the sampling intensity from the messages when used as an instrument. Also, an average sentiment measurement for each time window can be obtained as follows, and also the standard deviation of the sentiment measurement can be similarly obtained, and it is denoted by s Throughout the experiments, we are going to need some tool for representing in parallel and on an intuitive way the statistical and temporal dynamics of several topics. For this purpose, we use the previously proposed M-mode representation [39]. As a notation example for a multidimensional signal x(t), we can define an M-mode using the dimension number f , so that (t, f ) denotes this representation in general terms. For instance, if the second dimension is the index corresponding to the j th topic, with j = 1, · · · , N t and N t the VOLUME 8, 2020 number of scrutinized topics, we can define the M-mode of the sampled process as a index-temporal and simultaneous representation of the realizations of the stochastic process as follows, where the convention of capital notation is used for the M-mode of the represented signal. If x j (t) denotes in general the non-sampled sentiment measurement signals used in this work, then x j (t + τ ) represents its replica displaced by delay τ , and its autocorrelation function [40] is calculated with The M-mode of the autocorrelation functions of the j th topic is then just given by The described elements allow us to give a set of descriptions of well-known principles in Stochastic Process Theory for the sentiment analysis based on message measurements.

C. ENTROPY AND MUTUAL INFORMATION
The study of the probability density function (pdf) of a discrete-time random process s[n] (mathematically mentioned here as p f ) is relevant because it quantifies the certainty and the uncertainty of the results obtained in each realization. In our case, this function is defined as: whereŝ j (f ) denotes the full set of scores for each message m j of entity f, and then, ρ is the resulting independent variable of all possible score values. The value of this function is positive throughout the domain, and it represents the overall distribution of its statistical density, aggregated through time and therefore independently from it. Here again we use the convention of capital to jointly represent the estimated pdf of a related set of entities, as denoted by: After the introduction of probability density notation, the analysis of the amount of information attached to an entity, based on Entropy (H ), and the Mutual Information (MI) among pairs of them, is proposed in this section. A relevant number of H measurements have been described in literature since the work published by Pincus in 1991 [41], and after Shannon's publication in 1948 [42]. In this paper, we evaluate the Entropy (H (j)), that is understood as the weighted average value of the amount of information of a certain variable. In our case we can rephrase it, as the amount of information of all the sentiment scores of the messages related to an entity in a period of time. Mathematically it can then be analytically expressed in the following terms according to notation developed in this paper: where ρ stands for the all unique possible sentiment scores, f is the entity, and P is the discrete probability density function of the score ρ for entity f . As a consequence, H (f ) is represented by a one-dimensional array of the Differential H evaluated in the time slot under study for every considered entity.
The second information driver that we evaluate in this paper is the MI, which is a measurement of the cross-dependence of two paired random variables, and it specifies the amount of information that can be obtained from one random variable from knowing the other [43]. Whereas MI can be defined for continuous and discrete random variables, we work here with the discrete version. In this case, S u and S v denote the two random variables corresponding to two entities measuredŝ u [j] andŝ v [i] for a given lexicon and sentiment analysis method, which can be divided into N and M states, respectively. Then, the MI of these random variables is given by where S u = ŝ u [j], j = 1, 2, . . . , N and the corresponding . . , M are said states of the observed time signals. It can be seen that this expression is symmetric in S u and S v and always positive, and is equal to zero if and only if S u and S v are independent [44]. Its units are bits, as far as we use the base-2 logarithm for H (j) calculations. Note that if S u and S v are independent, then the knowledge about S u does not provide any information about S v , and hence MI u,v = 0. On the other hand, if S u is a deterministic function of S v and S v is a deterministic function of S u , then all the information conveyed by S u is provided by S v , and vice versa, and in this case, the MI will equal to the H of S u , which is also equal to the H of S v .

D. SELECTED LEXICONS
Several well-known lexicons have been used to perform the experiments described in Section IV. As shown in the experiments, different lexicons can result in slightly different characterizations of the sentiment distributions under study. For example, some lexicons are capable of a richer dynamic range than others. At the same time, we analyze the basic statistical properties of the distributions to determine their dependence with respect to the specific lexicon used.
For the purpose of data acquisition and processing, we developed a tweet ingestion tool that is capable of downloading tweets, up to a specified number of messages or to a specific date limit, from various user accounts or reacting to them via mentions. After tweets have been downloaded, sentiment analysis scoring is performed using the desired lexicon. The sentiment analysis score distribution is then statistically characterized. The sentiment analysis performed by the tool is based on a simple bag-of-words model, which only takes into account individual words and their associated scores. There is no provision for n-grams, full phrases, or any other type of multi-word combinations. In addition, scoring is based on a discrete rating in a range associated to each word, and terms usually associated with a positive (negative) sentiment are associated to a positive (negative) score.
Most of the lexicons used had to be preprocessed in order to comply with the sentiment scoring model we just described. The following is a description of the used lexicons, as well as the transformations we needed to apply to be able to use them with our model.
First, lexicon AFINN-165-EN, (hereinafter referred to as AFINN) [45], consists of 3,382 English words to which a numeric sentiment score was manually assigned. Its score range is [-5,5]. Second, the EmoLex NRC emotion lexicon list (NRC) [46] is a word-based lexicon that was created using a crowdsourcing approach via Mechanical Turk. On it, 14,182 English words were annotated according to ten non-exclusive binary categories, including the eight emotions from Plutchik's wheel of emotions [47]. For our study, we selected classes named ''positive'' and ''negative'', and ignored words that were classified as neither. There is no strength or intensity gradation in this lexicon, so we arbitrarily assigned score value -3 to all negative samples, and +3 to all positive ones. This resulted in 5,636 words that could be used in our model. Third, the Sentiment Composition Lexicon for Negators, Modals, and Degree Adverbs (SCL) was published in [48]. Sentiment associations were also obtained manually through crowdsourcing using the Best-Worst Scaling annotation technique [27]. It consists of single words and multi-word phrases, which were created by combining single words with modifiers such as negators, auxiliary verbs, degree adverbs or a combination of those. Since our model requires single words, we discarded all multi-word phrases. We also converted the score scale (a real number between -1 and 1) to the discrete range we use. Fourth, the SemEval-2015 English Twitter Lexicon (SemEval) [49] is another crowd-sourced lexicon consisting of 1,515 terms (including neutral ones), with the particularity that terms are drawn from English Twitter and include general English words, misspellings, hashtags, and other categories frequently used in Twitter. It includes negated expressions that were excluded from this study. Finally, the SentiStrength lexicon [50], [51] was included here. Like SemEval, Sen-tiStrength also uses terms and expressions drawn from social media, including emoticons. It assigns sentiment scores to word prefixes, like ''abhor*'', in the belief that any words that use the same prefix, such as ''abhorrence'' or ''abhorrent'' in this example, will share the same sentiment score. Since our corpora were extracted from Twitter we kept all emoticons, but transformed prefixes into single words using an English dictionary. The result was a list of 7,126 non-neutral terms, with the particularity that it includes more negative terms than positive ones. This bias is possibly a consequence of the dictionary expansion we performed on prefixes.
Though, it can then be inferred from these brief descriptions, lexicons and their structure can be very much intertwined with language. There are efforts to produce sentiment analysis lexicons in a vast amount of languages including Arabic [27] or Chinese [52]. Our study focuses solely on English-language lexicons.

IV. EXPERIMENTS AND RESULTS
In this section we present the results of the implemented statistical and temporal dynamics obtained with a custom-designed sentiment analysis tool, which was created to support the tasks of retrieving the tweets for a set of users, and the search result. In Fig. 2, an example of the summary results for a selected username in the front-end of the application can be seen. The application supports in the same framework the score calculation according to a set of selected lexicons for each tweet, as well as the analysis of the statistical and temporal dynamics of the sentiment scores per user, and the information-theory analysis within and among groups.
A total of 4 different sets of experiments were carried out. In Experiment I, we show an overview of statistical and dynamic results for a specific group of usernames related to universities. In Experiment II, we scrutinized the effect of not considering retweets in the same statistics evaluated in the previous one. In Experiment III, we evaluated the effect of using different dictionaries and their impact on the statistical representations. In Experiment IV, we analyzed the behaviour of user groups with different nature, namely, universities, singers, media, political leaders, technological companies, and financial companies (see Table 1 for details).

A. SENTIMENT INFORMATION AND DYNAMICS
This first experiment was performed for the selected universities shown in Table 1. Only user counting on an appropriate volume of messages over a single week were considered for the study. We used the application program interface provided for public use by Twitter TM which allows compiling the messages for the past 7 days, and only English messages were considered. The dictionary AFFIN − 165 was used for this first analysis, as it was specifically developed for microblogs. The messages were acquired during the same week in all cases to avoid week-dependent deviations. Figure 3 illustrates the results for the selected group of universities. Panel (a) depicts time evolution of scores for all users. Note that the tweets have been retrieved for each username during seven consecutive days, from Monday 14 th to Sunday 20 th of October. Accordingly, prior to time-slot consolidation, in this first panel we represented a non-uniform sampled signal. A strong variability of the time series can be observed, as well as some visual symmetry around the zero score. A different view, now using time slot consolidated and discretized score terms, is given by the score histograms VOLUME 8, 2020  represented in Panel (b) for each user. The convergence of the histograms was checked for the number of tweets retrieved in the seven day period. Multimodal distributions can be generally observed for all users, with a commonly shared high level mode at zero score, which represents a common baseline. On the other hand, additional positive and negative local maxima are easily found in almost all cases, exhibiting particular magnitudes for individual users. Positive and negative modes are quite close to the zero level. In cases where no positive or negative modes are found, a rather continuous decaying evolution of the number of messages sharing the magnitude is exemplified. Panels (c) and (d) offer a different perspective as they show respectively, H (j), understood as the complementary information of the each user, and the MI among users. In terms of information, users in panel (c) share a common profile as values are generally ranging a short span from 2.7 to 3, whereas one single exception is found in @UCLA. In terms of MI, only significance could be found among @Harvard vs @MIT , and @UCLA vs @Columbia. First coupled universities relation could be probably found on the geographic proximity, but that is not the case for @UCLA and @Columbia. The existence of MI when retweets are considered showed isolated specific couples (@MIT and @Harvard, and @UCLA and @Columbia). The relation of the first set could be related to the geographical proximity but the existence of this second group require of a deeper analysis of specifics. This effect is present both, with or without retweets, but special evidence is visible when excluding retweets and @Yale showed a relevant relation with a number of entities. This might be related to the tractor effect of Yale as one of the top universities in USA. Further analysis would required for a better understanding of these trends. Figure 4 shows a graphical representation of the key statistics analyzed in this paper. For reader convenience, panels in this figure are consecutively named after the previous figure, to emphasize the figures related to the same reality or experiment. In particular, the first panel of Figure 4 is named (e) following the last panel (d) of the previous figure. This structure is repeated later on with some of the subsequent experiments, for an easier comparison among them. Panel (e) incorporates the aggregated number of tweets over hourly slots. This representation offers the intensity perspective in the communication. It can be generally appreciated a daily circadian rhythm with effective minimums over the night and relative lower intensity over the weekend. An example of this behavior can be found in @Cambridge_Uni. Clear exceptions to this daily cycle are certain days in @MIT or @Harvard. In @MIT , there is a relevant increase in the number of tweets after day 1, which starts to decrease after one day. Periodicity of this intensity is presented in panel (f) through the autocorrelation function. As expected from previous results, a maximum is visible in one-and two-days delays. In this panel, we find not such a strong seasonality for @Harvard and @MIT . For this last user, the autocorrelation decays progressively, showing that it is statistically a non-persistent time processes. For the VOLUME 8, 2020 rest of the users, the peaks are present in day one and, to a lesser extent, in the second day, showing somehow a lack of short-time memory regardless of the rationale behind these effects.
Panels (g) and (i), included in Fig. 4, show the mean and the standard deviation of the sentiment score for all users during the week under study. Panels (h) and (j) represent the autocorrelation of the mean and standard deviation, respectively. In this four panels no significant visual pattern can be appreciated. No clearly visible stationary in the autocorrelation, nor definite shapes in the absolutez values.
The exclusion of retweets, the use of different lexicons, and the different types of user groups, will allow us further comparative analysis in the next experiments.

B. DYNAMIC ANALYSIS EXCLUDING RETWEETS
This second experiment has been conducted for the same group of university usernames, but without considering retweets in the message acquisition. Several conclusions can be drawn when the statistics and the dynamics are compared with and without retweets, which emerge from the time evolution of the scores. Figure 5 (b) of Experiment II shows positive monomodality for @Harvard (it has reduced negative peaks). In all the universities, the positive peak was found to be patent, and exceptionally no clear peak was found on the positive side. There were a negligible number of peaks on the negative side, and when they did appear, they did not reached a comparable level versus their positive counterparts. There were also a reduced number of bimodalities present in the same side, and when found, they were very mild on the negative side. A comparison of the histograms of the sentiment score for each user with and without retweets, in Figures 3 (b) and 5 (b), shows that the number of occurrences is much lower when retweets are not considered (see Panel (b)), as trivially expected, but the convergence and stabilization of histograms still occurs.
In this second experiment, the positive or negative local modes are present but to a lesser extent visible in relation to zero modes growing. An exception to this behavior is @Harvard, where positive modes appear not to be so affected by the exclusion of the retweets.
As far as H is concerned, Figure 5 (c) of Experiment II shows much more homogeneous H across the users, exhibiting higher values in those users that exhibited lower values when retweets were considered. Visible previous drops of entropies, especially sharp for the @UCLA in Figure 3 (c), are not that perceptible. A close comparison of these figures shows, for example, relevant reduction of information for @UCLA and @UCBerkeley. On the contrary @Oxford, @Harvard and @Cambridge, did not change much in terms of H when the retweets were not considered. Therefore, from a statistical perspective, incorporating retweets can have an even impact in terms of H for different users, but in some cases the reduction of H in academic institutions is noticeable.
The MI in Figure 3 (d) for Experiment I and Figure 5(d) for Experiment II is very regular. Especial interest can be VOLUME 8, 2020   Figure 3 (d), where over the general reduced values of MI, two couples of users, @Harvard vs @MIT , and @UCLA vs @Columbia, manifest a singular MI. This appears to be isolated effects, and not generalizable for the other user combinations. By the same token, in panel (d) of Figure 5 we can appreciate higher values in general and much wider cross relation as far as MI is concerned. Although the previous relation among the two paired users is still present, the case of Yale requires specific attention, as it now arises with relevant MI relationship with a number of universities.
In Experiment II, Figure 6 on panel (f) depicts the autocorrelation of the intensity, showing the same seasonal daily effect described when retweets were considered, although in this second case it appears to incorporate a higher amount of variability or noise. On the contrary, when it comes to analyze the autocorrelation of the standard deviation, once the retweets are removed on panel (j), it shows a clear seasonal-daily effect, which was previously hidden. Note that, in both autocorrelation analysis when retweets are not considered, a second peak is materialized on the second day. These results show not only a circadian pattern (daily) in the behavior of the number of tweets (intensity), but also and even more clearly in the standard deviation of the score that stands. It is remarkable that this effect is not visible at all in the mean of the sentiment, but it is in the standard deviation in this case. Figure 6 (i) of Experiment II shows replicable patterns throughout the days, including lack of signal during the night, reflecting also the inactivity also visible in panels (e) and (g), which is coherent with the time when people sleep and they are inactive. For example, in @Cambridge we can find a flat effect caused by the night. This clearly seasonable profile solely does not justify the results of panel (j), as this night-day reality was also visible in Figure 4, but the corresponding panel (j) does not suggest it.

C. IMPACT OF DICTIONARIES AND SCORES
We further aimed to scrutinize the impact of the different dictionaries and scores used as the basis for the sentiment  retrieval from short messages. For this purpose, we continued using the same set of universities, without taking into account resent messages.
It is relevant to emphasize not only the different number of terms in the selected dictionaries, but also their common intersections, all of whose countings are shown in Table 2. The largest and the smallest dictionaries are SentiStrength and SemEval, respectively, in such a way that we could consider as large dictionaries also AFINN and NRC, as well as reduced dictionary also SCL. We can also see that SentiStrength is to some extent the dictionary with more shared terms with the others in general, but not with SemEval, which turns to be the dictionary with lower number of shared terms with all the others.
We can also analyze the discrete distribution of the scores in each dictionary, as seen in Fig. 7. Both AFINN and Sen-tiStrength show strong bimodality in the positive and negative selected terms, whereas SCL and SemEval tend to work with more equalized histograms for positive and negative terms. In all cases, the tails (corresponding to score values of ±5) are much more under-represented, though in the equalized histograms they are notably more present. It is also interesting to note the preponderance of negative terms in the non-equalized dictionaries.
Additional information can be seen when representing the difference of the scores in those shared terms by pairs of dictionaries, as exhibited in Fig. 8, where each panel shows  Table 2. See text for details. the non-ordered shared words by pairs of dictionaries in the horizontal axis, and the score difference for each term in the vertical axis. The order of pairs of dictionaries follows the super-diagonal in Table 2 being read by rows. It can be noted that many shared terms do not share a strongly similar score in the rank. Moreover, the visible spikes in these plots (both in the positive and negative directions) often indicate shared terms which even score with opposite sign in different dictionaries.
With all these descriptive statistics of the dictionaries, their scores, and their shared and different information, we now analyze some of the statistical and temporal dynamics as captured upon said dictionaries. Figure 9 shows several panels for H (j), the MI, and the mean autocorrelation of representative dictionaries. We checked that the H profiles were strongly similar for all dictionaries, with some more patent increase in the range for SentiStrength, so that this increase in the range could be explained by adding much more words than in the other, rather than by equalizing the positive and negative scores. With respect to MI, it can be observed that there is a different behavior in SemEval when compared with all the others that turn out to share a similar profile, and that it consists on an increase of the information shared by different used being more clearly retrieved with the former. Also, SemEval is the only one with a trend to yield some more persistence in the mean-score autocorrelation functions for several of the users. In all cases, the autocorrelations of the process itself were strongly similar among all, and mostly dominated by the seasonal components, as seen in the previous experiments. Probably the criteria for selecting the terms in SemEval could be the key for the smallest dictionary yielding some more information in terms of MI and of temporal dynamics. However, it seem clear that the tasks performed in the different dictionaries (such as increasing the number of terms or histogram equalization in scores) do not seem to impact much on the statistical and temporal dynamics, and resolution and sensitivity, respectively, could be further pursuit in future dictionaries. VOLUME 8, 2020

D. ANALYSIS ON DIFFERENT NATURE GROUPS
Experiment IV compares users from different environments, allowing us to compare the previously defined statistics and features across different nature groups. For that propose we incorporated over the previously evaluated set, the academic entities, five others industries, namely: (i) Financial Institutions, (ii) Communication Media Companies, (iii) Singers, (iv) Technological and Internet Firms, and (v) Politicians. Each group comprises ten of the most recognizable usernames, to guarantee the necessary minimum amount of messages for statistical significance. The AFIN dictionary was considered, and analysis was performed not including retweets. Figures 10, 11, 12, 13  As seen previously for the academy industry, the autocorrelation of the sentiment score intensity shows a clear repetitive daily profile (see Fig. 11, and this profile is again shared across industries. The only exception to this pattern is the singers industry, where this outline is not visible at all, either in a consolidated approach, or at an individual user level. This pattern seems to be very singular to this particular industry, and it will require special attention in the discussion. A different reality is represented in the autocorrelation of the standard deviation of the sentiment score, where the periodic effect detected visible in the case of universities, seems to be  replicable only in the case of financial institutions, but not in the rest of the analyzed industries (See Fig. 12. No reference or representation is devoted to the autocorrelation of the mean of the scores, as no different pattern was found in this representation in any industry compared with the expressed in Experiments I and II.
On the other hand, MI comparative analysis is shown in Fig. 13, which did not offer relevant differences from the previous results for the case of universities. In general terms, it appears to be industry specific, or more precisely user specific, as no replicable pattern can be observed in this comparison among industries. But special attention requires the case of @IBM @Dell, and @Huawei in the Technology sector as they showed relevant MI among peers. A similar situation appears in Politics with four relevant users leading the MI, namely, @mbachelet, @GeorgeHWBush, @David C ameron and @eucopresident. Additional discussion and further analysis is required for an adequate interpretation and feature VOLUME 8, 2020  extraction, as in certain industries high values among users could imply significance in terms of cross-relation and sentiment contagion among those users.
Special attention requires the score histograms shown on the first column of panels in Figure 14. This consolidated evocated view shows the presence of the neutral mode on zero that is extremely prominent in all industries and users. Separate behavior can be found in industries when it comes to lateral modes, although apparently all industries present at least a slightly higher weigh of the positive mode, and in a deeper analysis it turned out not to be so in certain cases. Visual inspection of consolidated histograms offers two separate patterns. The first pattern, especially represented by Academic Institutions and Singers, shows a clear positive mode that happens to appear higher than their negative counterpart, which is in some cases almost non-existent. And a second pattern can be observed in which both sides are more parallel-like, with some individual exceptions.
It is also relevant to mention that, in specific industries, wider lobes in the positive branches are clearly present. These situations are less visible in a first visual inspection, but it might be of much higher significance in a mathematical evaluation. In an attempt to make it visible in this experiment, we evaluated the relation among the weighted positive and negative legs of the histogram for each industry and use, excluding the baseline mode in zero. To do so, we created a new index by dividing the compounded positive side of the histogram over the compounded negative side, excluding zero. For better representation, ten times the decimal logarithm of these values are plotted, converting this non-dimensional into logarithmic units (dB) and highlighting especially the values close to origin. From a mathematical perspective, if we define the histogram of the sentiment score as the probability density function, but this time discretized attending to the different possible values of the sentiment according to our model, the histogram function will be mathematically as follows: where ρ are all possible score values, and f stands for the entity or user under study. Now we can define the Compounded Aggregated Positivity Index or CAPI as ten times the logarithm of relation among the compounded aggregated positive and negative sentiment, and mathematically could be computed as where CAPI f stands for the user f (in dB units), ρ are all possible score values, and p(ρ) is the value of the histogram of the score of the sentiment for ρ, and in other words the number of messages of the user f where sentiment score was evaluated as ρ. Figure 15 shows the computation results of CAPI for all users and industries. According to results, CAPI values are very much industries dependent. Academic users tent to keep similarity among all users in the sector, that is expressed with a lower standard deviation of the values obtained. It is generally found an average positive valuation of this index (close to 6 dB). The second set, although with quite relevant positive values, presented a much wider variability of this index. This characteristic of Financial Institutions, is shared by Singers, and Technological Entities. The industry with higher CAPI is the Singers, with an average close to 7 dB. In absolute terms the three users with larger index belong also to this industry, almost doubling the average of all sectors. We can mention Lady Gaga close to 12 dB, Justin Bieber with about 10 dB, and Riana about 9 dB. It should be noticed at this point the existence of two sectors with relevant negative values of CAPI . This is the case of the Media and Politicians. The first of these two groups has mostly negative values over the sample, where only two of them (The Sun and ITV News) have a slightly greater weight of positive sentiments against the negative ones. The second groups, The Politicians, present a more balanced model between positive and negative, where only in three out of ten cases, the negative weight exceeds the positive. The largest negative values of the index in absolute terms are achieved in this industry. We should highlight the existence of a very wide dynamic range of CAPI values in this industry, where one of the users exceeds almost three times his next counterpart, Emmanuel Macron. Special analysis and discussion is required to evaluate this situation.
According to the experiments carried out in this section, we can summarize hereafter the key findings. On the one hand, the comparison between the analysis with and without retweets has shown that the existence of retweets in some cases slightly emphasizes existing modes, but on the contrary, it hinders the vision of some dynamic patterns of interest. On the other hand, comparing diverse dictionaries, the results have not shown significant differences that justifies intensively enough the use of dictionaries with a greater or lesser number of words. On the other hand, in the comparative analysis of the different industries, wide differences in their statistical characterization with different tools have been observed with intensity. Next section elaborates and discusses in detail these results, arguing on possible justification for them

V. DISCUSSION
The main objective of this work has been to evaluate the possibility of characterizing and modeling the sentiment aroused on social media, in an example of Twitter community. The detailed analysis of this reality can offer institutions, companies, entities, or users themselves, a valuable tool to know the effectiveness of their communication, and not only through digital media but also by any other means, since the sentiment collected could be considered as a consolidation of all aggregated inputs. The aim is then to evaluate the subjective and subconscious consolidated emotional feeling attitude awaken by a user, brand, entity, company, or institution, as expressed directly or indirectly by the interactions on the social media under evaluation throughout the words they used.
We fist analyzed ten selected Academic Entities, including direct tweets and with and without retweets. In the benchmarking analysis including and not considering retweets, no relevant information according to our analysis was missing when retweets were not included. However, this second scenario showed that new patterns arose that were not visible when the retweets were present. Hence, we argue that retweets do not add statistical relevant information (appart from the volume of total messages), and they even can statistically hide relevant patterns, which were visible once retweets were removed from the sample-base.
A more detailed view showed, for the histogram of scores excluding the zero-modes, that the presence of retweets strengthened the peak levels of the existing modes, either they were positive or negative scored. This could be interpreted as if the presence of retweets reinforced the sentiment bias of non-baseline modes, but they do not add additional information as far visual patterns are concerned. In this very same analysis, a reduction in entropy was found for some individual users when retweets were considered. A possible explanation for this reduction, taking into account that this is not a general behavior, could be found to our knowledge, such as the existence of actuator automatic bots, or the relevant number of users just replicating the original tweet but not adding additional personal bias, among others. As a consequence and prior to further analysis of this reality, we could argue that this effect could not a-priory be considered as a value-building element for the model.
Additionally, and due to the fact that the signal extracted as the consolidation of an eventual infinite number of fully independent sentiments are expected to behave as a clean aleatory stochastic process, large values of entropy should not be judged necessarily negative. In summary, and to our consideration, the joint effect of a negligible increment in amplitude of the sentiment modes in the histograms, the hinder of certain periodicity dynamics, and the possibility of a not positive sentiment bias, will suggest the consideration of not including the retweets in the sentiment analysis when scrutinizing the dynamics.
A relevant number of different lexicons, in terms of size, methodology, or even sentiment structuring scores, are present in the literature. In this paper we evaluated and benchmarked only five of them over the Academic Institutions. The tasks performed on the different dictionaries (such as increasing the number of terms or histogram equalization in scores) do not seem to impact much on the statistical and temporal dynamics, and resolution and sensitivity should be further pursuit in future dictionaries. In this setting, we can argue that the use of small databases in terms of number of words, although convenient from a computational stand point, might not be recommendable given that their generalization capabilities could also be strongly limited. So, for analysis across industries we proposed AFIN, a dictionary keeping some balance between computational efficiency and generalization expectations.
Relevant findings were obtained in our cross-industry evaluation. Specifically, daily-circadian behavior was found not only in terms of intensity (number of tweets), but also and even more clearly visible in the Standard Deviation of score, suggesting that there exists a pattern during the day when the tweets are being released, and that the sentiment evoked, although not following this daily behavior in terms of the mean, it clearly does in terms of variability. This might suggest that, although there is no relation over what is the sentiment people show over the day, it is in terms of how disperse it is. General intensity circadian cycles, could be very much explained in a double fold way: Fist, due to the normal circadian cycle of the people (not tweeting during night and doing it essentially at certain moments in the day); Second, by the professional activity from official community managers and communications departments. A different and more interesting result is the existence of the circadian cycle in the standard deviation of the score, which requires of much deeper analysis to justify if any sociological perspective could be underling this effect showing that people are more predisposed at certain times of the day to maximize their subjective assessments (feelings). Another possible argument could be that this effect highlights the fact that the communication departments of the academic entities generate, of course positive tweets, and always at the same time of day. Although this would be consistent with the pattern in the standard deviation, it would be not with the fact that this pattern is not reflected in the autocorrelation of the mean.
Considering the relevant multimodal effects presented in the score sentiment histograms, and given the particularity of each one of them, both in industries and among users in the same group, in this paper we propose a new indicator that allows a compounded aggregated sentient index (referred in the paper as CAPI ) to summarize all effects expressed by the different observers over certain users. This index revealed interesting results that were not visible in visual inspection of histograms, that turned out to be characteristic in some industries. In particular, stands out the negative values and generalized results of this index for Media Communications and Politicians, whereas on the other hand, Universities and Singers stick out in the positive side. Also, at individual user level, manifested index out of the pack compared to their peers.

VI. CONCLUSION
In conclusion, we consider that it is possible to create indices that allow evaluating the sentiment generated by a brand, entity or individual person, making use of the systems like Twitter and their provided info, through sentiment statistics and dynamics as presented here. On the other hand, considering the difficulties found to justify some findings, we understand that there is much room to continue improving the analytical capacity of this type of techniques by applying new and more sophisticated processing and semantical analysis, as well as by expanding the sample base in terms of the number users incorporated in the study, and of course by widening the temporal scope. He is an experienced professional who has devoted his career to the development of software-based projects and services enjoyed by millions. He was a member of the teams that created the first massive online education portal in Spain, the first online bookstore and the first free Internet service provider in the same country. After spending several years at Spanish branches of the France Télécom Group, he co-founded LateNiteSoft, in 2008, to create mobile applications for iOS. Their Camera+ app brings advances in digital image processing and computational photography to the public and has been downloaded more than 12 million times. He has been an Assistant Professor with Universidad Rey Juan Carlos, since 2017.
SERGIO MUÑOZ-ROMERO received the B.Sc. degree in telecommunication engineering and the Ph.D. degree in machine learning from the Universidad Carlos III de Madrid. He has led pioneering projects, where the machine learning knowledge was successfully used to solve real Big Data problems. Since 2015, he has been the Head of Data Science and Big Data with Persei vivarium. He is currently a Researcher with the Universidad Rey Juan Carlos. His current research interests are centered in explainable machine learning algorithms, and statistical learning theory and their applications to big data. He is a Professor with the Department of Signal Theory and Communications, Universidad Rey Juan Carlos, Spain. He has coauthored more than 130 international articles and has contributed to more than 180 conference proceedings. His research interests focus on statistical learning methods for signal and image processing, arrhythmia mechanisms, robust signal processing methods for cardiac repolarization, and Doppler image post-processing. VOLUME 8, 2020