Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers

Financial and economic news is continuously monitored by financial market participants. According to the efficient market hypothesis, all past information is reflected in stock prices and new information is instantaneously absorbed in determining future stock prices. Hence, prompt extraction of positive or negative sentiments from news is very important for investment decision-making by traders, portfolio managers and investors. Sentiment analysis models can provide an efficient method for extracting actionable signals from the news. However, financial sentiment analysis is challenging due to domain-specific language and unavailability of large labeled datasets. General sentiment analysis models are ineffective when applied to specific domains such as finance. To overcome these challenges, we design an evaluation platform which we use to assess the effectiveness and performance of various sentiment analysis approaches, based on combinations of text representation methods and machine-learning classifiers. We perform more than one hundred experiments using publicly available datasets, labeled by financial experts. We start the evaluation with specific lexicons for sentiment analysis in finance and gradually build the study to include word and sentence encoders, up to the latest available NLP transformers. The results show improved efficiency of contextual embeddings in sentiment analysis compared to lexicons and fixed word and sentence encoders, even when large datasets are not available. Furthermore, distilled versions of NLP transformers produce comparable results to their larger teacher models, which makes them suitable for use in production environments.


I. INTRODUCTION
The latest advances in Natural Language Processing (NLP) have received significant attention due to their efficiency in language modeling. These language models are finding applications in various industries as they provide powerful mechanisms for real-time, reliable, and semantic-oriented text analysis. Sentiment analysis is one of the NLP tasks that leverages language modeling advancements and is achieving improved results. According to the Oxford University Press dictionary, 1 sentiment analysis is defined as the process of computationally identifying and categorizing opinions expressed in a text, primarily to determine whether the writer's attitude towards a particular topic or product is 1 https://lexico.com The associate editor coordinating the review of this manuscript and approving it for publication was K. C. Santosh . positive, negative, or neutral. Sentiment analysis is becoming an essential tool for transforming emotions and attitudes into actionable information.
Designing and building deep-learning-based sentiment analysis models require substantial datasets for training and testing. While there are several large, publicly available sentiment-annotated datasets, they are mostly related to products and movies. Many sentiment analysis models [1]- [4] use these datasets and achieve good performance in related domains. However, the application of these models in different domains is challenging because each domain has a unique set of words for emotion expression.
The financial domain is characterized by a unique vocabulary, which calls for domain-specific sentiment analysis. Prices observed in financial markets reflect all available information related to traded assets [5], hence new information allows stakeholders to make well-informed and timely decisions. The sentiments expressed in news and tweets influence stock prices and brand reputation, hence, constant measurement and tracking of these sentiments is becoming one of the most important activities for investors. Studies have used sentiment analysis based on financial news to forecast stock prices [6]- [8], foreign exchange and global financial market trends [9], [10] as well as to predict corporate earnings [11].
Given that the financial sector uses its own jargon, it is not suitable to apply generic sentiment analysis in finance because many of the words differ from their general meaning. For example, ''liability'' is generally a negative word, but in the financial domain it has a neutral meaning. The term ''share'' usually has a positive meaning, but in the financial domain, share represents a financial asset or a stock, which is a neutral word. Furthermore, ''bull'' is neutral in general, but in finance, it is strictly positive, while ''bear'' is neutral in general, but negative in finance. These examples emphasize the need for development of dedicated models, which will extract sentiments from financial texts.
Sentiment analysis in finance has become an important research topic, connecting quantitative and qualitative measures of financial performance. A seminal study by Loughran and McDonald [12] shows that word lists developed for other disciplines misclassify common words in financial texts. Hence, Loughran and McDonald created an expert annotated lexicon of positive, negative, and neutral words in finance, which better reflect sentiments in financial texts. In [13], the authors introduce a Twitter-specific lexicon, which, in combination with the DAN2 machine learning approach, produces more accurate sentiment classification results than support vector machine (SVM) approach while using the same Twitter-specific lexicon.
Machine learning methods for sentiment extraction have been applied on datasets of tweets or news [14]- [18]. In [15], the authors use various machine-learning binary classifiers to obtain StockTwits tweets sentiments. They show that the SVM classifier is more accurate compared to Decision Trees and Naïve Bayes classifier. In [16], Atzeni et al. test the performance of various regression models in combination with statistical and semantic methods for feature extraction to predict a real-valued sentiment score in micro-blogs and news headlines, and show that semantic methods improve classification accuracy.
Researchers have used lexicon-based approaches in combination with machine-learning models. The authors in [18] show that such combinations are more efficient for sentiment extraction than using single models. However, regular machine-learning methods are unable to extract complex features and to keep the order of words in a sentence. These tasks require the use of deep-learning approaches, which allow for complex feature extraction, location identification, and order information [19].
Deep-learning methods [20] use a cascade of multiple layers of non-linear processing units for complex feature extraction and transformation. Each successive layer uses the output from the previous layer as input, thus extracting complex features which in many cases can be useful for generating learning patterns and relationships beyond immediate neighbors in the sequence. Many studies confirm the efficiency of deep-learning models, including recurrent neural network (RNN) [21], [22], convolutional neural networks [23]- [25] and attention mechanism [19], [26] in sentiment extraction in finance. The great success of deep-learning approaches in NLP is mainly due to the introduction and improvement of text representation methods, such as word [27]- [29] and sentence encoders [30]- [33]. These convert words/sentences into vector representation, making them suitable as input for neural networks. These representations keep the semantic information coded into words and sentences, which is crucial for sentiment extraction.
Recent developments in NLP, deep-learning, and transferlearning have significantly improved the sentiment extraction from financial news and texts [17], [34]- [37]. In [35], Yang et al. incorporate inductive transfer-learning methods such as ULMFiT [38] for sentiment analysis in finance, and the results show improvements in sentiment classification compared to traditional transfer-learning approaches. The superior performance of recent NLP transformers, BERT and RoBERTA, in sentiment analysis is evaluated in [37], where the effectiveness of using the RoBERTa model is compared to dictionary-based models.
Studies have used sentiment analysis based on financial news to forecast stock prices [6]- [8], foreign exchange and global financial market trends [9], [10] as well as to predict corporate earnings [11].
This paper aims to survey approaches to sentiment analysis, including combinations of machine-learning and deep-learning models with lexicon-based feature extraction methods and word and sentence encoders, up to the most recent NLP transformers. The goal is to apply these approaches to finance. We evaluate and compare model effectiveness when trained under same conditions and on the same dataset. The main contribution of this paper is the development of an evaluation platform, which we use to assess the performance of NLP methodologies for text feature extraction in finance.
We show that recent advances in deep-learning and transfer-learning methods in NLP increase the accuracy of sentiment analysis based on financial headlines. Moreover, our results indicate that lexicon-based approaches can be efficiently replaced by modern NLP transformers.
The rest of the paper is organized as follows. Section II provides an overview of NLP methods for text representation: lexicon-based and statistical, as well as word and sentence encoders. Section III presents NLP transformers, their architectures and objectives, as a separate group of deep-learning models for text classification, which we evaluate in extraction of finance text sentiments. Section IV describes the dataset that we created to evaluate text representation methods. Section V presents the evaluation platform that we build for measuring model performances. Section VI reports the results, and section VII concludes the paper and considers future applications.

A. LEXICON-BASED KNOWLEDGE EXTRACTION
Lexicon-based sentiment analysis methods rely on domainspecific knowledge represented as a lexicon or dictionary. The process of sentiment calculation is based on identifying and keeping words that hold useful information while removing words that are not related to sentiments in finance.
To infer the sentiment, we evaluate the Loughran-McDonald lexicon (a financial lexical rule-based tool) and the general-purpose Harvard IV-4 dictionary (general sentiment dictionary). We calculate the sentiment polarity using the Lydia system [42]. Each of the words in the sentences is categorized into either a positive or a negative group based on its sentiment in the lexicon (Eq. 1). If polarity>0, then the sentence is classified as positive, and if polarity<0, then the sentence is classified as negative.
When using machine-learning (ML) and deep-learning (DL) classifiers, we extract the headline features by replacing the words in the sentence with the sentiment value, specified in the dictionary. Next, we input the newly generated sequence into the neural network to classify the text. The DL's output soft-max layer calculates the probability that the sequence belongs to either the positive or negative sentiment labels.

1) COUNT VECTORS
Count Vectorizer (CV) is a simple statistical approach to text representation which converts a collection of text documents into a matrix of token counts, thus reducing the entire sentence into a single vector. The positions in the vector represent the number of appearances of each word in the sentence. The CV algorithm performs feature extraction by using a vocabulary of words (tokens) which can be built from the same text corpus, or input manually (a-priori) from an external resource. The vocabulary limits the number of features which can be extracted from the text.
The CV approach for text representation has some drawbacks. First, the ordering information gets lost due to the methodology for term ''squeezing.'' Second, the contextual information of the sentence is hidden, although it is crucial for sentiment extraction. These issues can be partially solved by using n-gram vectorizers where two, three or more consecutive words are put together in order to form tokens. Another issue with CV is that it shadows the important words that hold decision-making features for classifiers, because it pays more attention to general, frequent words such as ''like,'' ''but,'' and, ''or,'' which do not add meaningful information. As a result, important text features may vanish, which calls for more sophisticated algorithms.

2) TF-IDF TERM WEIGHTING
TF-IDF (Term frequency -inverse document frequency) is an algorithm for statistical measurement, which evaluates the relevance of a word in a document within a corpus of documents. It addresses the feature-vanishing issue of CV algorithms by re-weighting the count frequencies of the words (tokens) in the sentence according to the number of appearances of each token. The algorithm works by multiplying two metrics: term frequency (TF), which calculates the number of occurrences of a term in the sequence (Eq.3), and inverse document frequency (IDF), which penalizes the feature count of the term if it appears in more sentences within the corpus (Eq.4), where t denotes the term, d denotes the document, D denotes the corpus of documents and N is the total number of documents.
In this study, we assess the feature extraction performance of the uni-gram and 2-gram count vectorizers as well as the TF-IDF term weighting in combination with machine-learning classifiers and deep-neural networks.

C. WORD ENCODERS
Statistical features do not provide semantics of the contextually close words, which means that words with similar meaning will not have similar codes. Many NLP tasks such as sentiment analysis, question-answering and text generation require detailed semantic knowledge that is not provided by CV and TF-IDF. To overcome these challenges, researchers have introduced word encoders [43] to convert discrete words into high-dimensional vectors composed of real numbers, using a procedure called word embedding. Word encoders help with understanding the context of the sentences, which improves the extracted features. These models are based on the principle of distributional hypothesis [44], in which the meaning of words is evidenced by the context. This approach establishes a new area of research in NLP called distributional semantics, which is the core of many contemporary NLP techniques, including word encoders. These methods are called distributional semantic models (DSM), also known in the literature as vector space or semantic space models of meaning [45]- [47].
The word encoders classify the words that appear in the same context as semantically similar to one another, hence assigning similar vectors to them. This retained semantic information is very useful for classifiers or neural networks. In this section, we provide an overview of the most popular word encoders: Word2Vec [43], GloVe [28] and FastText [29], [48], which exemplify different approaches in modeling word embeddings.

1) Word2Vec
In 2013, a team of researchers at Google, led by Tomas Mikolov, introduced the breakthrough model for word representation called Word2Vec [27], [43], which marked the beginning of a spectacular evolution in NLP. Mikolov and his collaborators proposed two model architectures for computing continuous vector representations of words by using the unsupervised approach: Continuous Bag-of-Words (CBOW) and Continuous Skip-gram Model (Fig. 1). The CBOW architecture predicts the current word based on the context, while in the Skip-Gram architecture, the distributed representation of the input word is used to predict the context [43]. The authors show the effectiveness of the proposed methodology experimentally, using several NLP applications, including sentiment analysis. Additionally, they demonstrate that the Skip-gram architecture gives more accurate results for large datasets because it generates more general contexts.
The main drawback of Word2Vec is its inability to handle unknown or out-of-vocabulary (OOV) words. If the model has not encountered a word before, it will be unable to interpret it or build a vector for it. Additionally, Word2Vec does not support shared representations at sub-word level, which means that it will create two completely different vector representations for words which are morphologically similar, like agree/agreement or worth/worthwhile [29].
In our analysis, we use a pre-trained version of Word2Vec on the Google News corpus, which contains almost 3 million English words represented by 300-dimensional vectors.

2) GloVe
In 2014, a team of researchers at Stanford University proposed GloVe, an improved methodology for word encoding, based on a solid mathematical approach [28]. GloVe overcomes the drawbacks of Word2Vec in the training phase, improving the generated embeddings. It emphasizes the importance of considering the co-occurrence probabilities between the words rather than single word occurrence probabilities themselves. The model combines two classes of methods for distributed word representations: global matrix factorization and Skip-grams are used to extract better features by examining the relationships between words. The global matrix factorization method can capture the overall statistics and relationships between words. On the other hand, Word2Vec Skip-gram's method is efficient in extracting the local context and capturing the word analogy. Both methods are successfully incorporated into the GloVe encoder, thus outperforming Word2Vec in many NLP tasks. GloVe is widely used as a word encoder for NLP-based sentiment analysis [49]- [51].

3) FastText
In 2016, the Facebook research laboratory introduced a novel method for word encoding called FastText, which tackles the generalization problem of unknown words [29], [48]. FastText differs from previous models in its ability to build word embeddings at a deeper level by harnessing sub-words and characters. In this method, words become a context and word embedding is calculated based on combinations of lower-level embeddings. Each word is represented as a bag of character n-grams. For example, the word ''finance,'' given n = 3, will be represented by the following character n-grams: < fi, fin, ina, nan, anc, nce, ce >. The main algorithm behind FastText is Word2Vec. Learning the sub-word information enables training of embeddings on smaller datasets and generalization to unknown words. FastText shows improved results in text classification [52], even in structurally rich languages such as Turkish [53] and Arabic [54], which require morphological analysis instead of assigning a distinct vector to each word.
We evaluate pre-trained FastText vectors in order to assess their performance on financial texts. We use the wikinews-300d-1M pre-trained model, which wraps 1 million word vectors trained on Wikipedia's 2017 corpus and the statmt.org 2 news dataset, where each embedding consists of 300 dimensions.

4) ELMo
In 2018, a team of researchers at Allen Institute for Artificial Intelligence developed an advanced word encoder called ELMo (Embeddings from Language Models) [55], whose word embeddings are learned from a deep bidirectional language model (biLM), pre-trained on large corpora of textual data. The essential feature, which makes ELMo different from previous word encoders, is that it produces contextual word embeddings considering the whole context in which the word is used. Hence, we can obtain different embedding for the same word in a different context, a major improvement from previous encoders, which always produce a static embedding. To tackle out-of-vocabulary (OOV) tokens, ELMo uses character-derived embedding, leveraging the morphological clues of words, thus improving the quality of word representations.

D. SENTENCE ENCODERS
In 2014, the idea of encoding entire sentences surpassed word encoding. The primary purpose of sentence encoders is to learn fixed-length feature vectors that encode the syntax and semantic properties of variable-length sentences. While a simple sentence embedding model can be built by averaging the individual word embeddings for every word of the sentence, this approach loses the inherent context and sequence of words as valuable information that should be retained in many tasks.
The main weakness of using sentence encoders to handle variable-length text input is related to the fixed size of the produced vectors. Long and short sentences are treated equally, producing the same number of extracted features, thus diluting the embeddings.
In this section, we outline recent and most prevalent sentence encoders [2], [30]- [33], to assess their ability to extract important features in sentence representation of financial headlines.

1) Doc2Vec
In 2014, the first successful sentence encoder, Doc2Vec [30] introduced an approach for representing variable-length fragments of texts (sentences, paragraphs, and documents) as fixed-size dense vectors, a.k.a. paragraph vectors. These vectors are trained to predict words in documents. Their primary goal is to make an appropriate distributed representation of large texts, overcoming the weaknesses of bagof-words methods. Paragraph vectors combine word vectors to build phrase-level or sentence-level representations. They epitomize a distributed memory model, holding the context of the paragraph and contributing to the prediction task of the next word in combination with word vectors. Additionally, paragraph vectors can be used as features for the paragraph, which can be fed as input to a classifier or to a neural network, making them appropriate for evaluation of sentiment analysis in financial headlines. To obtain sentence embeddings, we use a Doc2Vec approach, which is pre-trained on English Wikipedia texts.

2) SKIP-THOUGHT VECTORS
Skip-Thought Vectors [31] are models that use encoderdecoder architecture for sequence modeling based on unsupervised learning. These models use continuity of texts, extracted from books, to train an encoder-decoder method. The model tries to reconstruct the surrounding sentences of an encoded passage in order to remap their syntactic and semantic meaning into similar vector representations. The encoder generates a sentence vector, and the decoder is used to generate the surrounding sentences. The model uses a Recurrent Neural Networks (RNN) encoder with Gated Recurrent Unit (GRU) [56] activations, and an RNN decoder uses a conditional GRU. The use of the attention layer provides for a dynamic change of the source sentence representation. Depending on the encoder type, two separate models are trained: uni-skip and bi-skip. Uni-skip passes sentences in the correct order and extracts 2400 features. The Bi-skip model uses two encoders. One of them passes the sentence in the correct order and the other passes the sentence in reverse order, extracting a total of 2400 features. Due to their generative nature, Skip-Thought vectors are appropriate and effective for neural machine translation and classification tasks. The main shortcoming of this approach is the arduous task assigned to the decoder [57], as the next sentence prediction requires modeling aspects that are, in most cases, irrelevant to the meaning of the sentence.

3) InferSent
InferSent [32] is a supervised approach to learning sentence embeddings using natural language inference (NLI) data. NLI captures universally useful features, thus learning universal sentence embeddings in a supervised manner. The training dataset used by this model is the Stanford Natural Language Inference (SNLI) dataset that contains 570k human-generated English sentence pairs, manually annotated with one of the three labels: entailment, contradiction, or neutral. Fig. 2 shows a shared encoder used for encoding the premise u and the hypothesis v. In order to extract relations between u and v, three matching methods are applied: concatenation (u, v), element-wise product u * v and absolute element-wise difference |u − v|. Next, the resulting feature vector is applied as input to the 3-class classifier to evaluate the relationship between u and v based on the extracted features. Experimentally, the best architecture for the encoder is shown to be the BiLSTM network with max pooling. This approach outperforms Skip-Thought vectors in many NLP tasks.
In our study, we assess the performances of two publicly available versions of InferSent. The first version is trained with Stanford's GloVe as word encoder and the second is trained with Facebook's FastText.

4) UNIVERSAL SENTENCE ENCODER
In March 2018, Google researchers published their first version of a model which converts variable-length sentences into 512-dimensional vectors, called Universal Sentence Encoder (USE) [2]. The model is able to embed not only sentences, but also words and entire paragraphs. USE uses the concept of transfer-learning to leverage the knowledge extracted from large datasets to improve the results when limited training data is available.
We evaluate the USE encoder, which is based on Deep Averaging Network (DAN) architecture as shown in Fig.3. Input embeddings for words and bi-grams are first averaged and then passed through a feed-forward deep neural network (DNN) to produce sentence embeddings. The computational time is linear in the length of the input sentence.
USE models are trained on a variety of data sources: Wikipedia, news, question-answer pages, and discussion forums. These models are based on transfer-learning experiments with several datasets to evaluate the efficiency of the encoder. The results show that sentence encoders outperform transfer-learning methodologies that use word-level embeddings alone.
The main issues with USE (DAN model) are related to the use of averaging techniques that cannot recognize negation phrases like ''not good.'' This refers to using contextualized embeddings, which considers the influence of other words in producing sentence embedding.
In our analysis, we assess the two latest versions of USE (4 and 5) that can be found at the TensorFlow Hub repository. 3

5) LANGUAGE-AGNOSTIC SENTENCE REPRESENTATIONS (LASER)
In 2019, Facebook researchers [33] introduced an architecture for universal language-agnostic multilingual sentence representations (LASER) for 93 languages by using a single BiLSTM encoder with a shared Byte Pair Encoding (BPE) vocabulary for different languages. The main contribution of the LASER methodology is that it provides a framework for zero-shot transfer-learning. LASER leverages one model, trained on one language, to be used in another language without the need for pre-training. This is accomplished by LASER's ability to bring semantically similar sentences, written in different languages, close to each other in the embedding space.
Sentence embeddings are obtained by applying a max-pooling operation to the output of the BiLSTM encoder. The same encoder is used for all 93 languages. The byte-pair encoding (BPE) vocabulary is learned based on the concatenation of all training corpora, hence, it does not require specific information about the input language. LASER's encoder architecture, illustrated in Fig. 4, is shown to be efficient even for low-resource languages.
In this study, we evaluate LASER on English texts, though the same model that we build here can be used for sentiment analysis in texts written in the other 92 languages supported by LASER.

III. NLP TRANSFORMERS
The pre-trained word and sentence embeddings show good performance for NLP tasks due to their ability to retain the semantics and the syntax of the words in the sentence. The transfer-learning task, in this case, allows for the information that has been learned from unlabeled data to be used in tasks with relatively small labeled data to achieve higher accuracy. Although such embeddings have proven to be powerful, they lack context-based mutability. Word2Vec, GloVe, and FastText use fixed embeddings for each of the words, thus producing one-to-one mapping, which in many cases is not appropriate and requires additional attention. Recent research studies have proposed methods that produce different embeddings for the same word, taking into consideration specific contexts [3], [55], [58]. As an illustration of context importance, we analyze the following two sentences that contain the word ''Apple'': ''Apple Inc performed well this year.'' and ''Apple fruits are exported to various countries.'' In the first sentence, Apple refers to the technology company Apple, headquartered in the US, while in the second sentence, apple refers to the fruit, with a completely different meaning. The encoders, however, will produce the same encoding for both words regardless of the contexts. This problem highlights the need for contextualized embeddings for the word ''Apple.''

A. NLP TRANSFORMER ARCHITECTURE
A transformer represents an architecture that transforms one sequence into another by using two models: encoder and decoder. Unlike previously described standard sequence-tosequence models, which are based on LSTM/GRU units, the paper ''Attention is All You Need'' [59] introduces a VOLUME 8, 2020 novel, breakthrough transformer architecture based solely on multi-headed self-attention mechanisms. There are three reasons for choosing self-attention instead of recurrent layer: computational complexity, parallelization, and learning long-range dependencies between words in the sequence, all of which are crucial for building contextualized embeddings. By using this approach, transformers have shown improved results in machine translation and other related tasks.
This method uses positional embedding to remember the order of words in the sequence. The main building blocks in the encoder/decoder modules are Multi-Head Attention and Feed Forward layers, as shown in the Attention-based transformer architecture (Fig.5).
The scaled dot-product attention mechanism is described by equations 5 and 6.
In Eq.5, the attention weights a represent the influence of each word in the sequence (Q) by all the other words (K) in the same sequence. Q is a matrix that contains the query (vector representation of one word in the sequence), K are all keys (all vector representations of all the words in the sequence) and n is dimensionality of the query/key vectors. The softmax function is used to ensure that weights a have a distribution between 0 and 1. Considering a, a self-attention is calculated by using Eq.6, which represents a weighted sum of values (V), where V is the vector obtained from the encoder.
A multi-head attention mechanism calculates the scaled dot-product attention multiple times in parallel. The independent outputs are concatenated and linearly transformed into expected dimensions. Multi-head attention is obtained by using Eq. 7: Each of the head i can be calculated by Eq. 8: A pre-training phase is an unsupervised learning approach where an unlabeled text corpus is introduced into the transformer architecture to produce text representations based on an objective function used by the transformer. This is a relatively expensive task, but the learned token or generic sentence representations can be used in many other tasks using transfer-learning. Later, the representation can be fine-tuned in order to recognize the specifics of the task and to achieve better results. Fine-tuning is performed by adding an additional dense layer after the last hidden state, recommended for using transformers in classification and regression tasks [3]. The transformer performs supervised learning (fine-tuning) on the labeled sentiment dataset, which is relatively inexpensive compared to pre-training.
NLP transformers are applicable to many different text classification problems, such as binary sentiment classification, which we use in our analysis.

1) BERT
In 2018, Devlin et al. [3] leveraged the transformer architecture to introduce a revolutionary language representation model, called BERT (Bidirectional Encoder Representations from Transformers). This model started the new era in NLP, with state-of-the-art performance achieved on most NLP tasks. BERT leverages the unsupervised learning approach to pre-train deep bidirectional representations from large unlabeled text corpora by using two new pre-training objectives -masked language model (MLM) and next sentence prediction (NSP). BERT overcomes the limitation of previous language models, which incorporate only unidirectional representations of words in sentences. It builds a bidirectional masked language model, which predicts randomly masked words in the sentence, enriching the contextual information of the words.
BERT is based on conventional, auto-regressive (AR) language modeling. The process of pre-training is performed by maximizing the likelihood between the tokens x in a text sequence x = [x 1 , . . . , x T ]. Letx denote the same text sentence with masked tokens and x be an array of masked tokens. The training objective for BERT is to reconstruct x fromx by Eq.9: where, • e(x ) denotes the embedding of the token x; • m t = 1, if x t token of the text sequence x is masked; • H θ is a Transformer which transforms each token of text sequence into a hidden vector. BERT assumes that all masked tokens x are mutually independent, which is the main rationale behind the approximation of the joint conditional probability p(x,x) in Eq.9. Another advantage that differentiates BERT from previous AR methods is the ability to increase the context information H θ (x) t by accessing the tokens placed on the left and the right side of token t.
BERT has two versions: BERT-base, with 12 encoder layers, hidden size of 768, 12 multi-head attention heads and 110M parameters in total; and BERT-large, with 24 encoder layers, hidden size of 1024, 16 multi-head attention heads and 340M parameters. Both of these models have been trained on English Wikipedia and BookCorpus [60].

2) FinBERT
FinBERT [61] is a version of BERT intended for the finance domain. It is pre-trained on a financial text corpus which consists of 1.8M news articles from Reuters TRC2 dataset, published between 2008 and 2010. Compared to other pre-trained versions of BERT, FinBERT model has achieved a 15% improvement in accuracy in text classification tasks specifically applied to financial texts.

3) XLNet
The XLNet model, developed by Google Brain and Carnegie Mellon University, addresses the disadvantages of BERT, improves its architectural design for pre-training, and produces results that outperform BERT in 20 different tasks. It utilizes a generalized AR model where the next token is dependent on all previous tokens, thus avoiding corrupted input caused by masking of the words, performed by BERT. The limitations of BERT include neglecting the dependency between masked tokens as it assumes that they are mutually independent variables. On the other hand, XLNet considers these tokens in the process of context building and assumes that masked words are mutually dependent.
Additionally, XLNet uses Permutation Language Modeling (PLM) to capture bidirectional context by maximizing the expected log-likelihood of a sequence given all possible permutations of words in a sentence. This means that XLNet enriches the contextual information of each position by leveraging the tokens from all the other positions found on the left and on the right sides of the token. Specifically, for a sequence x of length T , there are T ! different orders on which the algorithm performs auto-regressive factorizations.
Let Z T be the set of permutations of the words in a sentence of length T. x z <t denotes the first t − 1 elements of the permutation z ∈ Z T . The PLM objective is given in Eq. 10.
The hyperparameter c can be derived from the hyperparameter K, where c = |z|(K − 1)/K , and it represents the cutting-point of the division of vector z into non-target z ≤c and target z >c subsequences.
As shown in Eq.9 and Eq.10, both BERT and XLNet perform partial prediction, due to optimization. The main difference lies in the choice of tokens used for context modeling. BERT predicts the masked tokens, assuming that targets are mutually independent, while XLNet predicts the last token in a factorization order z >c .
The following example [Wells, Fargo, is, a, bank, in, USA] explains the difference. Assume that our goal is to predict ''Wells Fargo.'' In order to use [Wells, Fargo] as prediction targets, BERT masks them, and XLNet samples the factorization order [is,a,bank,in,USA,Wells,Fargo]. Using Eq. 9, BERT will compute: J BERT = log p(Wells|is, a, bank, in, USA) + log p(FARGO|is, a, bank, in, USA) (11) Using Eq. 10, XLNet will compute: J XLNet = log p(Wells|is, a, bank, in, USA) + log p(FARGO|Wells, is, a, bank, in, USA) (12) These examples show that both BERT and XLNet compute the objective differently. XLNet captures important dependencies between prediction targets, such as (Wells, Fargo), which BERT omits. Hence, XLNet combines the advantages of AR and auto-encoding methods by using a generalized AR pre-training approach with a permutation language modeling objective, in order to improve the results in NLP.

4) XLM
The Cross-lingual Language Model (XLM) [62] has a transformer architecture that is mainly used for modeling cross-lingual features. XLM is pre-trained using several objectives: • Causal Language Modeling (CLM) -next token prediction.
• Masked Language Modeling (MLM) -approach similar to BERT's objective for masking random tokens in the sentence.
• Translation Language Modeling (TLM) -supervised approach, which harnesses parallel streams of textual data written in different languages in order to improve cross-lingual pre-training support. VOLUME 8, 2020 In our analysis, we use XLM for text classification tasks to perform sentiment analysis of texts in English. We explore bi-directional context of the tokens in sentences to perform Masked Language Modeling (MLM), which is the best approach for our evaluation task.

5) ALBERT
To overcome the shortcomings of using large pre-training natural language representations such as GPU/TPU, memory limitations, and longer training times, in 2019 Google Research and Toyota Technological Institute jointly released a new model that introduces BERT's smaller and more scalable successor, called ALBERT [63]. ALBERT is based on two-parameter reduction methods: cross-layer parameter sharing and sentence ordering objectives, in order to lower memory consumption and increase the training speed of BERT. ALBERT outperforms BERT in several tasks, including text classification [64]. ALBERT uses a significantly reduced number of parameters in sentiment analysis, compared to BERT and XLNet.

6) RoBERTa
The RoBERTa model, introduced by the Facebook research team in 2019 [4], offers an alternative optimized version of BERT. Retrained on a dataset ten times larger, with improved training methodology and different hyperparameters, RoBERTa removes the Next Sentence Prediction (NSP) objective and adds dynamic masking of words during the training epochs. These changes and features show better performances compared to BERT in many NLP tasks, including text classification.

7) DistilBERT
DistilBERT, introduced in October 2019 [65], is based on a methodology that reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. The technique that produces a compression of the original model is known as knowledge distillation. The compact (student) model is trained to reproduce the full output distribution of the larger (teacher) model or ensemble of models. Rather than training with a cross-entropy over the hard-targets (one-hot encoding of the classes), the student obtains the knowledge based on a distillation loss over the soft-target probabilities of the teacher. The distillation loss L ce is calculated by using the Eq. 13.
where t i and s i are the estimated probabilities of the teacher and student respectively. This objective results in a richer training signal, since soft-target probabilities enforce stricter constraints compared to a single hard-target. We assess the performances of three distilled versions (students) of the following transformers (teachers): BERT-base-cased, BERT-base-uncased, and RoBERTa-base.

8) XLM-RoBERTa
The XLM-RoBERTa (XLM-R) [66] model is a multilingual model trained on one hundred different languages by using 2.5TB of filtered CommonCrawl data and it is based on Facebook's RoBERTa model. XLM-R achieves solid performance gains for a wide range of cross-lingual transfer tasks, including text classification. Additionally, XLM-RoBERTa offers a possibility of multilingual modeling without decreasing per-language performance, which makes it more attractive for evaluation compared to other transformers.
XLM-R follows the XLM approach [62], trained with a Masked Language Modeling (MLM) objective with minor changes to the hyper-parameters of the original XLM model.
In our analysis, we evaluate the performance of two different pre-trained XLM-R models: XLM − R base and XLM − R Large , which differ in the size of their parameters.

9) BART
In October 2019, the Facebook research team published a novel transformer called BART [67] with an architecture similar to both BERT [3] and GPT2 (Generative Pre-Training 2) [68]. BART outperforms other transformers in generation tasks such as text summarizing and question answering. BART leverages the advantages of the bidirectional encoder from BERT and the GPT AR decoder. The auto-regressive approach means that GPT considers left to right dependence of the words in a sentence, which makes it more appropriate for text-generation compared to BERT. BART's encoder and decoder are connected by cross-attention. Each decoder layer performs attention over the final hidden state of the encoder output. This mechanism enables the model to generate output that is closely connected to the original input.
The fine-tuned model concatenates the input sentence with the end of sequence (EOS) token and passes these components as input to the BART encoder and decoder. The representation of the EOS token is used to classify the sentiment expressed in the sentence. In this study, we fine-tune BART and adapt it to sentiment analysis in finance.

IV. DATASETS
We use publicly available datasets that have been labeled by financial experts to perform a reliable evaluation of the ML models in predicting sentiments of financial headlines. We perform binary classifications to designate each of the sentences as bullish (positive) or bearish (negative), as described in the following subsections.

A. FINANCIAL PHRASE BANK
The Financial Phrase-Bank dataset [69] consists of 4845 English sentences selected randomly from financial news found on the LexisNexis database. These sentences have been annotated by 16 experts with a background in finance and business. The annotators were asked to give labels according to how they think the information in the sentence might influence the mentioned company's stock price. The dataset also includes information regarding the agreement levels on sentences among annotators. All sentences are annotated with three labels: Positive, Negative, and Neutral. The distribution of sentiment labels is presented in Table 1.

B. SemEval 2017 TASK 5
The second dataset used in this paper is provided by the SemEval-2017 task ''Fine-Grained Sentiment Analysis on Financial Microblogs and News'' [70]. The Financial News Statements and Headlines dataset consists of 2510 news headlines, gathered from different publicly available sources such as Yahoo Finance. Each headline (instance) is annotated by three independent financial experts, and a sentiment score, in the range between -1 and 1, is assigned to each statement. A score of -1 means that the statement (message) is bearish or very negative, and a score of 1 means that the statement is bullish or very positive. We convert these sentiment scores into sentiment labels (bullish/bearish). The conversion process is performed by using Eq. 14.
After the conversion, the number of sentences per label is presented in Table 1.
The dataset used for evaluation is a combination of both datasets. To address the imbalance between positive and negative sentences, we perform a balancing by extracting 1093 positive and another 1093 negative sentences, which we merge into one dataset. Additionally, we shuffle the datasets and we set aside stratified 80% of all sentences as a training and stratified 20% of the remaining sentences as a validation set. At the end, our balanced training set includes 1748 samples, and a balanced validation set consisting of 438 samples.

C. DATA PRE-PROCESSING
Financial headlines, similar to other real world text data, are likely to be inconsistent, incomplete and contain errors. Hence, to prepare the data, we perform initial pre-processing that includes tokenization, stop-word removal, and stemming. Additionally, we extract the named entities (organizations and people) from the headlines and replace them with their general nouns. For example, Microsoft is replaced with <CMPY>, or London with <CITY>.
We impose a min-max length of sentences to 3-64 words. After this initial filtering, we obtain the distributions of the  number of words per sentence for the training set (Fig. 6) and for the validation set (Fig. 7).
When evaluating lexicon-based and word encoders, we perform left padding to sentences in order to fix their size, due to their variable length. Considering the maximum size of the sentences given in Figs. 6 and 7, we pad them to 64 word length. When using sentence encoders, we do not pad the sequences due to the ability of the sentence encoders to encode sentences to fixed-size vectors.

V. SENTIMENT ANALYSIS PLATFORM
We evaluate the sentiment analysis methods by using the general platform, consisting of five phases shown in Fig. 8, as follows: • In the first phase, we create our working dataset based on the Financial Phrase Bank and the SemEval 2017 dataset.
• In the second phase we apply data pre-processing functions as described in subsection IV-C.
• The third phase performs text encoding by using various text representation methods in order to extract features from the pre-processed texts. We evaluate the following text representation methods: domain lexicons, statistical models for feature extraction, word encoders, sentence encoders and NLP transformers.  • In the fourth phase, these embeddings are fed as input to various machine-learning or deep-learning classifiers, thus enabling us to evaluate many encoding-classifier combinations. • In the fifth phase, we compare the real and predicted labels using several binary classification performance metrics. The sentiment analysis platform is implemented in Python 3.6. The shallow models are developed using Tensorflow Keras [71] while the pre-trained versions of NLP transformers are retrieved from the Hugging Face repository [72]. The sentiment analysis modules are published at the GitHub repository. 4 4 https://github.com/f-data/finSENT In the following subsections, we present the details of machine-learning and deep-learning classifiers, fine-tuning of NLP transformers and evaluation metrics.

A. MACHINE-LEARNING CLASSIFIERS
In our evaluation analysis, we use two machine-learning classifiers: Support Vector Classifier (SVC), as a representative of Support Vector Machines (SVM), and an Extreme Gradient Boosting (XGB) [73], [74], as a representative of gradient-boosted decision trees. We chose the XGB model VOLUME 8, 2020  because it has achieved impressive results in many Kaggle competitions, in the structured data category. When using the ML classifiers, we perform a GridSearch approach for retrieving the best hyper-parameters.
The text representations and the features extracted from the evaluation methods are fed as input into Convolutional Neural Networks (CNN) [23] and Recurrent Neural Networks (RNN) [82] in order to proceed with the classification. While RNN networks work well in sequence modeling and capturing long-term dependencies, CNN networks are more efficient in capturing spatial or temporal correlations and in reducing data dimensionality.
In order to improve the architecture of previous DNN networks, novel mechanisms have been introduced. One of them is the Attention mechanism [83], which helps RNN networks focus on specific parts of the input sequence, facilitating the learning and improving the prediction. The Attention mechanism is widely used in encoder-decoder architectures due to its ability to highlight important parts of the contextual information.
Bidirectional RNN networks are often used to collect features from both directions. A forward RNN − → h gathers token features from the start (x 1 ) to the end (x n ), while the backward RNN ← − h processes the tokens in reverse direction, from (x n ) to (x 1 ). The resulting hidden state h uses both sets of features concatenating − → h and ← − h as shown in Eq.15: where ⊕ denotes the concatenation function.
In our analysis, we used shallow RNN and CNN networks in order to evaluate the features from text representations. These shallow neural networks consist of three main layers: the input (embedding) layer, the hidden layer, and the output layer. The input layer uses text representation methods (lexicons/word or sentence encoders) to extract the feature vectors from the headlines. It then gives the vector as an input to the recurrent or convolutional hidden layer to extract complex features from the text representation methods. The output layer uses a softmax activation function to make the final classification. We then add an attention layer after the hidden layer to evaluate its effectiveness. Furthermore, we build an additional group of GRU and LSTM networks, which support bidirectional feature extraction, to assess their performance in finance-based sentiment analysis as described in [84]. and we use binary cross-entropy loss function when training the models. The ADAM (Adaptive Learning Rate) optimization algorithm [85] is used to find optimal weights in the networks. We use a maximum of one hundred training epochs for all DL models. We impose early stopping when the validation loss does not diminish after ten epochs to prevent over-fitting. Finally, we use dropout layer as regularization in the CNN network [86].

C. MODEL FINE-TUNING
To evaluate NLP transformers, we use pre-trained models from the Hugging Face's repository [72]. For finBERT, we use the language model trained on TRC2 dataset, published on the GitHub repository. 5 We fine-tune the transformers with the training dataset by adding only one dense layer after the last hidden state. The dense layer outputs the probabilities of sentence classification. Transformer's hyper-parameter settings during the fine-tuning phase are not model agnostic and they are directly related to the quality of the model.

D. EVALUATION METRICS
We evaluate the models for sentiment analysis of financial headlines, and present the results chronologically, based on the models' publication date. We first evaluate lexicon-based methods, using Harvard IV-4 and Loughran-McDonald dictionaries. Next, we evaluate word encoders as pioneers in MCC is widely used in assessing binary classification performance with a range between -1 (completely wrong binary classifier) and 1 (completely accurate binary classifier). It takes into consideration true and false positives and negatives, thus providing a balanced measure, which can be used even if the classes have different sample sizes.

VI. RESULTS AND DISCUSSION
In this section, we present the model evaluation results.
In Table 3, we report on the performance of the lexicon-based models by using hand-crafted feature engineering, based on the Loughran-McDonald (LM) financial and general Harvard IV-4 dictionaries. We perform the evaluations by using the Lydia system polarity detection, machine-learning classifiers, and deep-learning models, as described in previous sections. As expected, the Loughran-McDonald features outperform the Harvard IV-4 general-purpose sentiment analysis dictionary. Hence, feature extraction with a domain-specific dictionary is a better approach for sentiment analysis tasks. The best performing model is the XGB classifier using LM features, achieving MCC=0.327. Additionally, we find that RNN networks outperform CNN and fully-connected dense networks. The improved results are due to the RNN networks' ability to remember sequential data, which is crucial for classification of sentences. Furthermore, the bidirectional context and attention layer improve the results when used in combination with RNN networks.
In Table 4, we present the results of the experiments performed on features extracted from statistical methods. We use ML classifiers and a deep neural network classifier based on fully connected dense layers. These methods show good results, achieving an MCC score of 0.667, almost twice as good as the lexicon-based methods.
In Table 5, we present the evaluation results of the word encoders. Generally, the best score is achieved when using Stanford's GloVe with Bidirectional GRU and attention layer (MCC=0.704). Here, the attention layer increases the MCC score by 0.04 compared to the BiGRU method without the attention layer (MCC=0.666). Additionally, the evaluated word encoders achieve better results when used with RNN networks, which further learn the context from the attention layer. In all tests, the GRU units outperform the LSTM units.
The features extracted from word encoders are significantly better compared to the features extracted by using lexicons and dictionaries. Furthermore, the word encoders perform better than statistical methods for feature extraction, which implies that incorporating semantic meaning into the word representation is useful for classification.
The results obtained from the evaluation of sentence encoders are presented in Table 6. InferSent, developed by Facebook, is the best performing sentence-based encoder. Its version 2 uses a simple architecture composed of fully connected dense layers which averages FastText word embeddings, thus outperforming Doc2Vec, Universal Sentence Encoder (USE), Skip-Thought-Vectors, and LASER. Additionally, InferSent outperforms the word encoder Fast-Text, which implies that the InferSent's algorithm for averaging the word embeddings has superior efficiency for sentence context representation. Furthermore, we find that the FastText version of InferSent outperforms the GloVe version of Inter-Sent. When using sentence vector representation, ML classi-fiers are more effective than CNN and a fully connected dense network.
In Table 7, we present the results of the first contextual word encoder, ELMo, which we evaluate in combination with ML classifiers (SVC, XGB) and DL classifier models (Dense, CNN and RNN). ELMo embeddings outperform the evaluated word encoders with fixed embeddings. This confirms the hypothesis that contextual word vectors extract better features than the fixed ones. Additionally, concatenated vectors of words embeddings in combination with BiGRU network and an attention layer outperform the other ML and DL networks.
We also evaluate the popular NLP transformers which support text classification. We fine-tune them with training data in order to bias the embeddings towards financial sentiment analysis. All transformer architectures outperform word and sentence encoders, as shown in Table 8. Hence, contextualized embeddings perform semantic tasks better than their non-contextualized counterparts. Among the family of BERT transformers, BERT-Large-uncased achieves the best score in classification, with MCC=0.859. Although FinBERT was pre-trained on Reuters financial texts, it does not perform as well as the other pre-trained versions of BERT, which use Wikipedia and BookCorpus as text corpora for pre-training.
RoBERTa's dynamic masking increases the efficiency of the BERT algorithm by 0.023. DistilBERT retains more than 95% of the accuracy while having 40% fewer parameters. A distilled version of RoBERTa achieves as good results as BERT-large while using half the parameters of the teacher RoBERTa-base model. Among the ALBERT family of transformers, ALBERT-xxlarge pre-trained model outperforms the other ALBERT versions, obtaining MCC=0.881. Additionally, ALBERT outperforms the BERT model. The cross-language model (XLM) also outperforms BERT and XLNet. XLM-MLM-en-2048 achieves the best result, with MCC=0.863, among all XLM versions. Finally, the latest NLP transformer, Facebook's BART, outperforms all the other NLP transformers when applied to finance data, achieving the best MCC score of 0.895.
We show the performances of text representation approaches in Table 2, while the performance of each method chronologically is shown in Fig. 9.

VII. CONCLUSION
This paper presents a comprehensive chronological study of NLP-based methods for sentiment analysis in finance. The study begins with the lexicon-based approach, includes word and sentence encoders and concludes with recent NLP transformers. The NLP transformers show superior performances compared to the other evaluated approaches. The main progress in sentiment analysis accuracy is driven by the text representation methods, which feed the semantic meaning of the words and sentences into the models. The results achieved by the best models are comparable to expert's opinion. The evaluations were performed on a relatively small dataset of approximately 2000 sentences. Even though the dataset is not large, we obtained good results, suggesting that this approach is appropriate for domains where large annotated data is not available.
Distilled versions (Distilled-BERT and Distilled-RoBERTa) of NLP transformers achieve text classification performances comparable to their large, uncompressed teacher models. Hence, they can be effectively used in text classification production environments, where the need for light-weight, responsive, energy-efficient and cost-saving models is essential.
The results of this study can be applied in areas such as finance, where decision-making is based on sentiment extraction from massive textual datasets. The findings imply that selected models can be successfully used for forecasting stock market trends and corporate earnings, decision-making in securities trading and portfolio management, brand reputation management as well as fraud detection and regulation [87]- [89].
Although this approach was constructed for sentiment analysis in the finance domain, it can be extended to other areas such as healthcare, legal and business analytics.

APPENDIX A RESULTS
See Tables 3-8.