Detection of financial opportunities in micro-blogging data with a stacked classification system

Micro-blogging sources such as the Twitter social network provide valuable real-time data for market prediction models. Investors’ opinions in this network follow the fluctuations of the stock markets and often include educated speculations on market opportunities that may have impact on the actions of other investors. In view of this, we propose a novel system to detect positive predictions in tweets, a type of financial emotions which we term “opportunities” that are akin to “anticipation” in Plutchik’s theory. Specifically, we seek a high detection precision to present a financial operator a substantial amount of such tweets while differentiating them from the rest of financial emotions in our system. We achieve it with a three-layer stacked Machine Learning classification system with sophisticated features that result from applying Natural Language Processing techniques to extract valuable linguistic information. Experimental results on a dataset that has been manually annotated with financial emotion and ticker occurrence tags demonstrate that our system yields satisfactory and competitive performance in financial opportunity detection, with precision values up to 83 %. This promising outcome endorses the usability of our system to support investors’ decision making.


A. MOTIVATION
Among the many factors that compel people to take decisions in stock markets, opinions on micro-blogging sites deserve consideration [1]- [3].It has been shown that fast paced information in social media is valuable for sales prediction [4]- [6] and has strong influence on micro-economic trends in stock markets [7].Investors' decisions can be affected not only by media content [1] but also by public mood [8].Besides, the advent of powerful social trading platforms, such as eToro 1 and XTB online trading 2 , has broadened the spectrum of input data.There already exist online tools to analyse these sources.Two examples are Thomson Reuters Eikon 3 , which performs sentiment analysis (SA) of social media content, and the Bloomberg platform 4 , which provides social media indicators.
Furthermore, [9] argued that, even though stock markets contain enough information to define their state, this information becomes obsolete before the general public can process it to take decisions.Therefore, instantaneous information in the Twitter micro-blogging network deserves attention [10]- [12], and, in particular, in finance [7], [13].
To the best of our knowledge, this work is the first proposal based on EA techniques to detect what we call financial opportunities, that is, Twitter posts that speculate or reason that the value of particular actives will grow, rather than merely expressing positive opinions about those actives.Specifically, we present a three-layer stacked classifier with a first stage to discard neutral entries, a second stage to distinguish generic positive emotions from negative ones and a last stage to differentiate opportunity entries -our ultimate goal-from any other positive emotions.
For this purpose, the system is enhanced with sophisticated linguistic information features such as n-gram sequences, emotion and polarity dictionaries, and frequency counters of hashtags, numerical information and percentages.

B. RESEARCH GOAL AND CONTRIBUTIONS
We seek a Machine Learning system to detect financial opportunities in textual information extracted from tweets, where by opportunities we refer to positive user speculations and forecasts (e.g."the average yield of Google is 15.5%") instead of mere positive statements (e.g."I like Google").They are akin to looking forward positively for upcoming events, as "anticipation" in Plutchik's theory [24].This problem has not been previously considered in academic research and has remarkable practical applications for both stock market screening and decision making.Notwithstanding this applied perspective, our research has also a theoretical goal, an enhanced understanding of the linguistic dimension of micro-blogging comments to extract investment opportunities.Summing up, this work contributes to more effective Machine Learning classifiers for stock market data mining with the following: • A three-layer stacked Machine Learning classification system for detecting opportunities in micro-blogging comments.
• An analysis of the most relevant features through NLP techniques.In addition, we employ a new dataset of investors' comments for stock market forecasting with 6,000 entriessimilar in size to those in other state-of-the-art studies [25]- [28]-, which has been manually annotated with financial emotion and ticker occurrence tags (e.g. in "the average yield of Google is 15.5%" the ticker tag is GOOGL) by experts in the field.
The rest of this article is organised as follows.Section II reviews related work in knowledge extraction, discussing EA from general and financial perspectives.Section III describes the classification problem and our solution.Section IV presents the experimental text corpus and the numerical tests that validate our approach.Finally, Section V concludes the paper.

II. RELATED WORK
Novice investors in stock markets give high consideration to the experience of financial experts.Thus, automatic knowl-edge extraction from the comments by experts in financial news, blogs and social media is an interesting research goal and a valuable asset for practical applications [29], [30].
Previous researchers have analysed online sources to characterise stock markets.Most of them have focused on news, as [31], who studied the correlation between financial market events and financial news content.In this regard, the work by [32] deserves a special mention.Using Machine Learning models of Latent Dirichlet Allocation, they concluded that the information extracted from news sources is better than price variations for predicting assets' volatility.Additionally, in [33] a Deep Learning (DL) model for news categorisation was presented proving that financial sources of information have a significant effect on investments.There is also work on knowledge extraction from financial blogs.For example, the method by [34] combined stock indexes with sentiment time series from micro-blogs.Besides, [29] applied multivariate regression to financial blogs to study how the markets react to the posts.Finally, the study of the information posted on social media platforms like Twitter has been a relevant research topic in recent years.Consider for example the analysis of consumer profiles by [35] and the study of the evolution of stock prices and social media content by [36] based on a latent space model [37].However, none of these works has considered user speculations about the evolution of the assets.Note that all of them have taken temporal information directly from the timestamps of financial forecasts or posts [29], [32], [34], [36].In fact, we are not aware of any financial research supported by temporal analysis at discursive level such as ours 5 .
In this context, NLP techniques [40] such as SA [41] and EA [42], [43] have successfully extracted knowledge from comments of stock market experts.Specifically, in [42] the authors presented a novel approach for the automatic extraction of a high-coverage and high-precision lexicon annotated with emotion scores, while in [43] they confirmed the impact of social media emotions on the stock market.
Often as an initial stage of EA, SA infers sentiment polarity from language features of user opinions [41], [44].Typically, three polarity levels (negative, neutral and positive) are considered [45], although some authors add two deeper positive and negative sentiment levels [46], [47].Among the wealth of research in Twitter mining, we can cite for example the work by [48] on the analysis of consumer opinions about commercial brands, or by [49] on financial SA.The latter classified Twitter information about eight stock markets with a Support Vector Classifier (SVC).
EA infers feelings from linguistic information in user comments [50], [51].It has become a central topic in the field of Affective Computing.Some typical emotion classes in the literature are love, joy, anger, sadness, fear and surprise, as defined by [52].For example, the approach by [53] detected affection in virtual communication environments.EA has numerous applications in education [54], healthcare [55], [56] and gaming [57], to cite some.It has also caught the attention of the industry as a profitable asset.In the particular case of finance, [58] proposed a Linked Data approach for modelling sentiment and emotions from tweets about securities of the Spanish stock market.The system for financial forecasting by [22] incorporated different sources of information into an integrated river model and combined GPOMS 6 with a Neural Network classifier, and [23] applied a multi-layer perceptron to study the influence of investor emotions in investment decisions.At the end, even though EA is still an interesting research topic [59], [60], no previous analyses have taken into consideration finance-related emotions, such as opportunity or anticipation, and have focused exclusively on general human emotions.
Training methodologies, on the other hand, can be divided into supervised (manually annotated), semi-supervised and unsupervised approaches.For example, in [61] some authors of this paper proposed an unsupervised system for automatic generation of emotion lexica with polarity data.Hybrid solutions such as [62] (which applied classifiers as well as unsupervised approaches with polarity lexicons and syntactic structures) have produced high competitive results.Finally, stacking strategies combining different Machine Learning techniques into the same predictive model can improve performance [63], [64].It has been our intent to exploit the benefits of stacking in ensemble learning techniques to further increase the performance of our system.

III. SYSTEM ARCHITECTURE
In this section we present our novel three-layer stacked system for detecting financial tweets on market opportunities.Specifically, the first stage detects neutral entries, the second stage distinguishes between general positive and negative emotions and the last stage separates opportunity entries from any other positive statements.Figure 1 illustrates its framework, which consists of two modules that will be described in detail in the next subsections, where financial tweets are taken from a training database.Table 1 shows examples of input data and the corresponding emotion categories (including "opportunity", which we have already described).In the context of Plutchik's theory [24], "positive statement" corresponds to any positive emotion on present events such as "expectation"; and "negative awareness" to any negative opinion about an asset, regardless of the sentiment expressed in the tweet, which is similar to "disapproval".

A. DATA PROCESSING MODULE
This module discards useless data.Specifically, we apply a series of text transformations to ensure the quality of the data that enter the classification module.Table 2 presents some examples before and after applying these transformations.
• Filtering.It selects finance-related tweets while discarding spam and related tweets.Basically, tweets containing asset identifiers such as stock markets' tickers, URLs and quantitative information.We also remove advertising entries with spam-related words and expressions such as sorteo 'draw', gratis 'free' and ¿quieres ganar dinero?'do you want to earn money?'.Moreover, we discard words in other languages than Spanish using the Enchant Python module 7 .Finally, to discard highly similar tweets, we implemented a similarity detector based on the Jaccard distance with a threshold of 0.75%, inspired by state-of-the-art spam detection techniques [65]- [67].• Hashtag and mention splitting.We decompose hashtags in words with a splitter that uses our own lexica [68], [69] and the Spanish frequency reference corpus (CREA) by Real Academia Española de la Lengua8 .As shown in the example in Table 2, the splitter divides the term acuerdocomercial successfully as acuerdo comercial 'commercial agreement'.• Spelling correction.We correct words with spelling mistakes by replacing them by the most likely candidate through our own word algorithm that uses CREA as well as the Enchant Pytdistancehon module.In the example in Table 2, the word precausion is corrected and replaced by precaución 'caution'.• Stock market asset, mention and hashtag detection and removal.We use regular expressions to detect capitalised letters and identify representative symbols such as $, @ and #.Once identified, all these elements are removed from the text of the entries.• Stop words removal.Meaningless words such as determiners and prepositions9 are removed from the text.We also remove URLs and retweet (RT) tags.We consider days of the week and months of the year as stop words.However, we keep elements such as no 'not', sí 'yes', muy 'very' and poco 'few', since they help to interpret the evolution of the assets.• Quantitative data and laughter replacement.We substitute quantitative data (numbers and percentages) by tags representing their sign, that is, '+' and '-' for positive and negative figures, respectively.We decided to use these symbolic representations rather than the amounts themselves because they are more clearly related to stock markets' upward and downward movements.Laughing onomatopoeic expressions such as haha and hehe, which are typically composed of the characters j or h interspersed with the vowels a, e and

Negative awareness
Ya están los resultados de $NFLX ??? 'Are $NFLX results available ???' Neutral i in any quantity, are replaced by the LAUGH tag.• Text lemmatisation.The content of the tweets is split into tokens (words), which are independently checked in the Hunspell dictionary 10 .Finally, the words are lemmatised using the Freeling NLP tool 11 .

1) Machine Learning model: feature processing
The features we selected to extract valuable information from our dataset and build our model are: • Char-grams, word tokens and word-grams.Char-grams are sequences of adjacent characters in a text (note that spaces must also be considered in this case).Word tokens represent character grams only from the text inside word boundaries, whereas word n-grams are sequences of adjacent words in a text, among which we also include the textual representation of the emojis in the tweets.To generate char n-grams, word tokens and word n-grams we combined CountVectorizer 12 with GridSearchCV 13 , both from the Scikit-Learn Python library, using the parameters in Listing 1. GridSearchCV performs an exhaustive search over specified parameter ranges for an estimator.As a result, we obtained the optimal parameters max_df = 0.5, min_df = 0.001 and ngram_range = (1,7) in our scenario.• Frequency counters.These features count the number of hashtags, negative and positive amounts and percentages, exclamation marks, interrogation marks and adverbs in the content of the tweets.• Dictionary features.We use our polarity and emoji lexi-TABLE2: Examples of tweets before and after processing.
LISTING1: Configuration for the generation of n-grams.
2) Emotion analysis: three-layer stacked system Figure 2 shows the flow diagram of the stacked system.It is composed of three stages: a first stage that distinguishes 14 Available at https://www.gti.uvigo.es/index.php/en/resources/8-lexicon-of-polarity-and-list-of-emojis -by-polarity-and-emotion-for-application -in-the-financial-field, August 2020. 15Available at http://www.cic.ipn.mx/∼sidorov/SEL.txt,August 2020.between neutral and non-neutral entries, a second stage to distinguish between positive and negative emotions and a last stage to extract opportunities from positive statements.In each of these stages, we applied a decision depth threshold to maximise opportunity detection, because in financial analysis a high precision in key categories is preferable over obtaining a large amount of positives [70].
The system includes implementations of Gradient Descent (GD), Decision Tree (DT), Random Forest (RF) and Support Vector Classification (SVC) algorithms from the Scikit-Learn Python library 16 .Note that the size of our training datasets did not justify the application of DL techniques [71], [72].In this regard, works such as [73], [74] are representative state of the art examples of DL techniques for EA where the datasets are much larger than ours.
Accordingly, we must characterise the level of confidence (depth level) of emotion prediction precision.This can be formalised with a non-parametric approach based on isotonic regression [75].This approach fits data with a non-decreasing function defining a partial or total ordering, by minimising quadratic expression (1) subject to (2). (1) In expression (1), each variable α i is a strictly positive weight (1 by default), y i ∈ N is the true category of the ith tweet and ŷi ∈ N its predicted category (so that y i and ŷi are taken to a numerical space that allows defining a partial or total ordering).The level of confidence in the prediction depends on the probabilities of correct class assignment.Logically, it is possible to select subsets of vectors in the training set for which such probabilities are higher, and thus to trade accuracy by precision.

IV. RESULTS AND DISCUSSION
All the experiments were executed on a computer with the following specifications: TxStockData S.L., a fintech company, provided us with a dataset composed of 6,000 tweets (similar in size to those in other state-of-the-art studies [25]- [28]).The dataset was gathered from 14th May 2019 to 3rd February 2020.Its entries were manually annotated with financial emotion tags by five experts in finance and NLP.At the end, the emotion tag of each entry in the dataset was computed by majority voting (in case of tie, it was selected randomly).To ensure the quality of the data, we discarded repeated and spam entries, yielding approximately 5,000 valid tweets 17 .Each entry has the following structure: • ID: a unique numerical identifier.
• Tweet: original text of the Tweet.
• Ticker: stock market assets mentioned in the text.
• Emotion: emotion label.Table 4 shows the distribution of the entries in the experimental dataset by emotion category.

B. DEFINITION OF PERFORMANCE METRICS
The practical goal of our system is to present to the user the content of the tweets that are marked as opportunities.Therefore, as previously said, we are interested in maximising precision for the target financial emotion opportunity.We defined two auxiliary tolerance metrics for those output decisions that do not correspond to true opportunities but will not necessarily discourage an operator, inspired by works like [70].
We explain these tolerance metrics with the confusion matrix in Table 5. Notation X Y represents the number of tweets of emotion category X that are classified into category Y .
TABLE5: Confusion matrix for tolerance calculations.
From this confusion matrix we define tolerances τ 1 and τ 2 as follows: In both cases the denominator is the sum of all decisions marked as opportunities.Tolerance τ 1 grows with S + P + , that is, when the user is presented positive statements as opportunities.Strictly these decisions are errors, but they have minimal impact on user trust if P + P + >> S + P + .Tolerance τ 2 is the inverse of the probability of presenting negative awareness results as opportunities, A − P + , which would strongly discourage an operator, that is τ 2 = 1 − (where N P + corresponds to neutral tweets from a financial perspective).Therefore, our declared goal is first and foremost a reasonably high precision in the detection of opportunities, for high values of τ 1 and, secondarily, of τ 2 .

C. RESULTS
In this section, we evaluate our system to detect financial opportunities.First we analyse the contributions the features of the Machine Learning model (see Table 3) using a basic single-layer classifier of opportunities vs. rest of emotions (Section IV-C1).Then we present the precision and tolerance results of a two-layer (neutral vs. non-neutral / opportunities vs. rest of emotions) stacked classifier, without (Section IV-C2) and with (Section IV-C3) decision depth thresholds.Finally, we present our final three-layer stacked classifier (which adds a final opportunities vs. positive statements layer) with decision depth thresholds, culminating our incremental design (Section IV-C4).All results presented in this section were computed by applying 10-fold cross-validation.Our model, with all characteristics, has ∼400,000 n-gram features.This is computationally intractable.Consequently, we applied an attribute selector to extract the most relevant features with a maximum decrease of 0.5% in precision.Specifically, we chose the SelectPercentile 18 method from the Scikit-Learn Python library, as it outperformed other alternatives (SelectFromModel, SelectKBest, and RFECV).This method selects features according to a highest score percentile.We set a χ 2 score function and an 80th percentile threshold.At the end, we kept ∼50,000 n-gram features.

1) Numerical test 1: relevance of the features
Firstly we evaluated the performance of a single-layer classifier that was trained to distinguish between opportunities and all the other financial emotions in Table 4. Table 6 shows average results for the target emotion opportunity with 10fold cross-validation.We started with a basic feature set only containing char-grams, word tokens and word-grams.Adding all features in Section III-B1 was slightly advantageous for all classifiers (up to ∼ 4% of improvement) except for the SVC.Moreover, in light of the results, RF seemed the best classifier to detect opportunities, followed by the GD classifier.
We considered these initial results (precision, τ 1 and τ 2 up to 64.33%, 70.65% and 87.31%, respectively) promising and they motivated us to design more advanced approaches.The difference between τ 1 and τ 2 , which ranged between 15% and 21%, indicated that a significant amount of neutral 18 Available at https://scikit-learn.org/stable/modules /feature_selection.html,August 2020.entries were considered as opportunities.This is why we decided to explore the benefits of a stacked classifier with an initial stage to filter neutral entries.
2) Numerical test 2: two-layer stacked classifier In this test we evaluated a two-layer stacked classifier.The first stage was designed two distinguish between neutral and non-neutral entries and the second stage distinguished opportunities from the rest of financial emotions in Table 4.In both stages we used the same type of classifier but we applied independent hyperparameter optimisation in each stage.Table 7 shows the resulting precisions and tolerances.
If we compare this test with numerical test 1, there was no significant improvement with the GD and DT classifiers but the stacked approach produced a 5% performance boost for the other two.More specifically, the SVC and RF classifiers produced the best results for all metrics and both scenarios (basic set and all features, except for the precision of the SVC if compared with the GD classifier with all features).Furthermore, even though the GD and SVC classifiers behaved similarly with all features, the GD classifier is far more computationally intensive.DT precision results were considerably bad, of around 40%.Thus we decided to discard GD and DT for opportunity detection and keep only SVC and RF in further analysis.
Ultimately, even though this test revealed further improvement in precision and tolerance with the two-level stacked classifier, the highest precision (with the RF algorithm) was still under 75%.However, the tolerances against operator discouragement were moderately satisfactory (τ 1 >75% and τ 2 >85%).

3) Numerical test 3: two-layer stacked classifier with decision depth
As previously stated, in businesses and investment, it is preferable to obtain less positives for key categories with high precision.Take personalised investment advertising as an example.Banks must identify customer profiles with higher success probability, since personalised commercialisation actions are rather expensive.From a practical perspective, we set the goal of detecting at least 10% of all opportunity entries with high precision.
Accordingly, in this test we relied on the predict_proba setting of the Scikit-Learn Python library to set class thresholds.It allows computing the probabilities of the possible outcomes (classes in our model), thus providing levels of confidence (see Section III-B2).When the probability of the most likely class for a vector exceeded the threshold of that class (which we term decision depth) the vector was assigned that class.Otherwise, the entry was considered neutral.This procedure was followed at both stages of numerical test 2, with the same depth value, after empirical tests.
Table 8 shows the results of the two-layer stacked classifier with decision depth.Both classifiers achieved significant precision improvements.Besides, the RF classifier achieved a precision above 70% in all cases for τ 1 ∼ 80% and τ 2 ∼ 90%.Regardless of the fact that precision and τ 1 were still not satisfactory (they were less than 75% and 90%, respectively), the values of τ 2 for the two classifiers with all features indicated that many negative awareness entries were still classified as opportunities, which was unacceptable.

4) Numerical test 4: three-layer stacked classifier with decision depth
All points considered, to meet our goals we finally applied the three-layer stacked classifier framework (see Section III-B2) with decision depth thresholds where the third stage distinguishes opportunities from positive statement entries.
We relied once again on the predict_proba setting of the Scikit-Learn Python library to set the depth threshold.As in the previous numerical test we used the same classifier and depth threshold in all layers.Table 9 shows the results of this test.Note the significant improvements by the SVC and RF classifiers for all metrics.With the RF classifier we obtained precision values above 80%, τ 1 above 90% and τ 2 above 95%, meeting our initial requirements.Less than 5% of the negative awareness entries were misclassified as opportunities, probably due to high ambiguity and, thus, annotation errors.

D. EXPLOITING THE RESULTS
For the sake of clarity and a handy comparison of the different numerical tests, Table 10 shows the differences in precision, τ 1 and τ 2 between numerical tests 1 and 4. The improvement achieved with SVC and RF classifiers is noteworthy, supporting the benefits of a stacking approach for financial opportunity detection.The performance boost for RF, our final choice, was almost 20% for precision and τ 1 .In addition to meeting our performance goals, we succeeded to identify as many opportunity entries with high precision as our application demanded, 9.27% and 10.43% on average with RF and SVC respectively, with 10-fold cross-validation.Moreover, Figure 3 represents the histogram of the mentions to assets in opportunity tweets.By examining the specific assets that were mentioned between 14 May 2019 and 3 February 2020 in detected opportunities in the testing sets in numerical experiment 4, we observed that those assets corresponded to ∼ 80% of the mentions to the most interesting stock actives (upper quartile of all mentions in opportunities in the testing subsets, marked in red in Figure 3.For clarity, only some representative well-known tickers are shown in the x axis).
Therefore, a possible way to exploit these results would be listing in a scroll section of the user dashboard the tweets we pick as opportunities with the three-level stacked classifier.Then we could highlight the tweets mentioning assets that are detected following our approach with decision depth.Figure 4 shows an example with real tweets as classified by our system.Symbol is displayed next to the tweets that contain opportunities that are detected with high precision.

V. CONCLUSIONS
Motivated by the strong influence of social media in users' decisions in global markets, we propose a novel three-layer stacked system to detect financial opportunities in tweets.Our solution has been designed to extract such tweets with high precision to present their content on personal finance management dashboards.It exploits sophisticated linguistic features, such as polarity and emotion dictionaries and temporal identification by discursive analysis, in its ML model.It yields satisfactory and competitive market-level performance in financial opportunity detection.
Experimental results, including two ad-hoc tolerance metrics focusing on data quality from an operator perspective, demonstrate that our three-layer stacked system with decision probability threshold (decision depth) succeeds to achieve our goal.In particular, using a RF algorithm with all features in the model, the system attains ∼ 83% financial opportunity precision with detection tolerances τ 1 ∼ 90% and τ 2 ∼ 96%.Given the valuable up-to-date information in micro-blogging platforms, these promising results endorse the usability of our system to support investors' decision making.
As future work and to further expand the contributions of our research, the emotion classifiers will be improved with domain-specific tweet filters and quantitative (objective) numerical information from stock exchange prices.In addition to opportunities, this will allow analysing positive statement and negative awareness emotions in stock market reports.Moreover, extensions for a multilingual version of our framework to cover French and English will be provided.FIGURE4: Possible integration of our system in a mobile app.

FIGURE1:
FIGURE1: Our proposed framework.TABLE1: Examples of dataset entries with emotion tags.TweetEmotion Parece que el #IBEX35 va a hacer un buen cierre diario, aunque me gustaría que fuese por encima de los 8.600C.'It seems that #IBEX35 is going to make a good daily close, although I would like it to break 8,600C.'Opportunity
TABLE3: NEG_AWARENESS_EMOJI Amount of emojis that express negative awareness about assets POS_STATEMENT_EMOJI Amount of emojis that express positive statements about assets OPPORTUNITY_EMOJI Amount of emojis that express opportunities about assets Temporal PAST Amount of verbs in past tense PRESENT Amount of verbs in present tense FUTURE Amount of verbs in future tense CONDITIONAL Amount of verbs in conditional tense Distribution of samples in the dataset by category. TABLE4: Precisions and tolerances for numerical test 1. TABLE6: Precisions and tolerances of the two-layer stacked classifier, numerical test 2. Precisions and tolerances of the two-layer stacked classifier with decision depth, numerical test 3. Precisions and tolerances of the three-layer stacked classifier with decision depth, numerical test 4.