Time Series Impact Through Topic Modeling

A time-series of numerical data and a sequence of time-ordered documents are often correlated. This paper aims at modeling the impact that the underlying themes discussed in the text data have on the time series. To do so, we introduce an original topic model, Time Series Impact Through Topic Modeling (TSITM), that includes contextual data by coupling Latent Dirichlet Allocation (LDA) with linear regression, using an elastic net prior to set to zero the impact of uncorrelated topics. The resulting topics act as explanatory variables for the regression of the numerical time series, which allows us to understand the time series movements based on the events described on the text data. We have tested our model on two datasets: first, we used political news to explain the US president’s disapproval ratings; then, we considered a corpus of economic news to explain the financial returns of 4 different multinational corporations. Our experiments show that an appropriate selection of hyperparameters (via repeated random subsampling validation and Bayesian optimization) leads to significant correlations: both an intrinsic baseline and state of the art methods were significantly outperformed by TSITM in MSE, MAE and out-of-sample $R^{2}$ , according to our hypothesis tests. We believe that this framework can be useful in the context of reputational risk management.

The appearance of investor based social networks such as 31 The associate editor coordinating the review of this manuscript and approving it for publication was Sajid Ali . StockTwits 1 and the Dongfang Wealth Network 2 have also 32 spurred interest in making investment recommendations from 33 investor's comments about stock trends [5], [6]. To us, it is 34 clear that the joint analysis of text and numerical time series 35 data is a growing field of research of considerable practical 36 importance. 37 Typically, this analysis has been approached from the 38 domain of text regression [7]. In its simplest form, text regres-39 sion may take as input a TF or TF-IDF representation of the 40 documents to predict a time series [8]. In this framework, 41 a weight will be obtained for each word of the vocabulary, 42 thus determining the contribution of each document to the 43 forecasting of the dependent variable. However, we identify 44 two main problems in this approach: first, there is a high 45 risk of overfitting, given that there are as many regressors as 46 words in the vocabulary; second, the interpretability of results 47 in section VII; the implications and limitations of our work 96 are discussed in section VIII. We conclude with some final 97 remarks in section IX. One of the first attempts to correlate topics with time series 106 that appeared in the scientific literature was the Iterative Topic 107 Modeling with Time Series Feedback (ITMTF) model [13], 108 an iterative framework for discovering causal topics. The 109 strategy is to progressively increase the correlation of topics 110 with the time series data through the introduction of prior 111 distributions in a feedback mechanism. The appropriate prior 112 distributions are obtained as follows: first, a collection of 113 topics is extracted from a corpus of texts using any stan-114 dard topic model; then a causality measure is computed to 115 identify correlations between the topics and the time series, 116 and a selection of candidate causal topics is obtained; for 117 each of these candidate topics, the most significant causal 118 words are obtained by applying the causality measure at 119 a word level; finally, the prior distribution is defined by 120 using the previously identified significant words and their 121 impact values, separating positive impact terms and negative 122 impact terms, and assigning prior proportions according to 123 the significance levels. By applying the topic model on the 124 collection of documents again, but using the new priors, 125 the new topics will be more correlated with the time series. 126 This process is repeated until a stopping criterion is reached. 127 Although this framework is successful at introducing external 128 information into the topics, it must be noted that it does not 129 make use of a unified model for text and numerical data. 130 This implies that there will be two different probabilities that 131 are not simultaneously optimized, so the efforts to create 132 semantically coherent topics might not be the right direction 133 to increase the correlation with the time series. We also 134 note that this framework does not provide a mechanism to 135 forecast the time-series values given the corresponding texts 136 either. The standard example of a unified model that jointly gen-140 erates topics and a discrete or continuous response variable 141 is supervised Latent Dirichlet Allocation (sLDA) [14]. Its 142 generative process mirrors that of LDA, but it also pairs each 143 document of the corpus with a response variable modeled by a 144 normal distribution whose mean is a function of the empirical 145 frequencies of the topics in that document. However, sLDA 146 is mainly used in document classification tasks and it was not 147 originally intended to be applied for time series prediction. 148 Recent attempts to employ sLDA for this purpose showed a 149 severe overfitting of the time series by distorting the topic 150 parameters [15]. Additionally, notice that in sLDA each docu-151 ment will be used to predict a value of the time series, instead 152 of aggregating all the documents published on a given time 153 period to make a single prediction. This problem could be 154 addressed by repeating the value of each numerical data as 155 many times as the total number of documents published on 156 that time period, but then the model would not be sensible each topic k for each document d (as in LDA) and, at the  take the scalar product with α tk (which has been previously 210 normalized using a softmax function). The b k coefficients, 211 which can be positive or negative, would quantify the impact 212 of each topic on the time series. This model can forecast 213 future values of the series if the matching texts are provided. 214 There is no regularization on the b k parameters, hence they 215 are prone to overfit when the number of topics expected is 216 sufficiently large (or this could lead to model complexity 217 issues). The number of topics is set by human choice, and 218 that determines how many correlations will be discovered by 219 the model. Also, the coupling of b k with the softmax of the 220 α tk introduces multi-collinearity. Machines or Multi-Layer Perceptron). As compared to the 244 aim of our work, the main shortcoming of this model is that 245 only one document is allowed to be published for each time-246 lag. We have in mind situations where several documents, 247 often dealing with very different topics, are published at 248 the same time-lag. Another issue is the lack of a lasso-type 249 regularization in this model, which implies that topics that 250 show no correlation with the time series will be used to 251 construct the FinLDA features. It is also important to notice 252 that standardization of the inputs is not performed in this 253 model, which poses a problem under different scaling of the 254 data. In [20] the output of a topic model (LDA) trained in a Nor-257 wegian financial news dataset is used as predictor variables 258 in a regression (Latent Threshold Model, to enforce sparsity), 259 in order to predict several macroeconomic variables, includ-260 ing asset prices as summarized in the Oslo Exchange index 261 OSEBX. There is no coupling between the time series and the 262 topics, in the sense that topics are trained alone. Surprisingly, 263 VOLUME 10, 2022 the daily topic distribution is equally normalized (to one) 264 every day, so that the presence of a topic in a given day 265 depends on the overall distribution of other topics that same 266 day.

268
In this article, we propose a joint model that takes as inputs previously cited. RG2 is more commonly satisfied, although 295 some models lack a topic dependent response (this is the 296 case for [13] In this section, we describe our proposed model, TSITM. 324 In subsection IV-A, we will briefly review the LDA model 325 and the elastic net regularization, which are the two build-326 ing blocks that will be later used to construct our model, 327 in order to fix notation and terminology. In subsection IV-B, 328 we will introduce the complete log-likelihood for our model 329 and describe its core features. Finally, in subsection IV-C we 330 will propose an optimization strategy for the log-likelihood 331 using the ECM algorithm, paying special attention to the use 332 of proximal operators to find numerical solutions when no 333 analytical methods are available.

335
For most of the paper, we will follow the standard notation 336 used in topic modeling, as it appears in [9] and [10]: a word 337 w is an item from a vocabulary of size V represented by a 338 one-hot encoding vector and a document is a sequence of N 339 words denoted by w = w 1, w 2 , . . . , w N , where w n is the nth 340 word in the sequence. We define a corpus as a collection of 341 D documents. Let us denote n dw the number of appearances 342 of the word w in the document d. The first building block 343 that we will employ in the construction of our model is 344 Latent Dirichlet Allocation [9], which aims at optimizing the 345 following log-likelihood: where k indexes each of the K topics; β kw are the parameters 349 of a multinomial distribution over the vocabulary for each 350 topic (i.e. the probability that topic k contains word w); θ dk 351 are the parameters of a multinomial distribution over the 352 topics for each document (i.e. the probability that document 353 d covers topic k); and α k and η w are the hyperparameters 354 of the Dirichlet distributions used as priors for the topic 355 distributions (i.e. to ''smooth'' the multinomial parameters). 356 When optimizing this log-likelihood via the EM method, 357 a latent variable z dwk is introduced to indicate which topic 358 generated each word of a document.

359
The second building block is regularized linear regression. 360 We will consider a time-series of target values y t , with t = 361 1, . . . , T , and J series of regressors X tj . We will make use of 362 linear regression with elastic net regularization [21], which 363 linearly combines the L 1 and L 2 penalties of the Lasso and 364 Ridge methods respectively, to fit the data. The log-likelihood 365 for this problem is: (2) can be seen as the product of a normal distribution with 378 mean µ 0 + J j=1 X tj µ j and an exponential prior distribution 379 for the µ j coefficients of the type  concepts is an abuse of notation; however, no conflict will 400 occur because time indexes will be labeled with t, t , . . .  distribution that depends on the topic presence parameters 410 θ tk at each timestamp t and the regression coefficients µ k , 411 which in turn are derived from an elastic net prior. The result 412 is depicted in Fig. (1b) and can be explicitly described with 413 the following generative process: ii) For each word token n ∈ {1, . . . , N d }: T .

424
This way, the same model that generates the topics will also 425 generate the time series in a joint manner. Notice that we 426 are not normalizing the mean of the Gaussian by the number 427 of documents at each time t, N t (as done in [22]), because 428 we assume that the absolute number of news should have an 429 impact on the time series. This way, we allow a null impact 430 in a day with little or no news.

431
The generation of the time series y t from the topic presence 432 over time θ tk has an additional interpretation in light of the 433 central limit theorem: if we had assumed that each document 434 d published at time t had a certain impact on the value of 435 y t (which we could model by a normal distribution with 436 mean µ (d) 0 + k µ k θ dk ), then the combined impact of all the 437 documents published at that time would result in a Gaussian 438 mixture. However, by applying the central limit theorem, 439 this Gaussian mixture could be reduced to a single Gaussian 440 distribution with mean µ 0 + k µ k θ tk , finally arriving at the 441 same result. 442 2) MODEL LOG-LIKELIHOOD 443 From the generative process described above, we arrive at the 444 following log-likelihood: This log-likelihood can be seen as the joint probability of 450 both (1) and (2). Notice the minus sign that precedes the 451 second term: the optimal likelihood is obtained when the 452 LDA log likelihood is maximized and the squared error of 453 VOLUME 10, 2022 beneficial to standardize. In order to solve these two prob-475 lems, we propose the following log-likelihood:

483
This hyperparameter has been multiplied by the total number 484 of tokens N C = dw n dw with the goal of making it invariant 485 across corpora of different sizes, and divided by 1000 for 486 readability purposes. In order to understand the role of τ , 487 consider the limit case when τ → 0: in this case, we recover 488 the LDA log-likelihood, thus neglecting the curve fitting part. 489 As we increase the value of τ , the regression part becomes 490 more relevant and the optimization surface changes towards 491 that of a linear regression. In the opposite limit case, τ → ∞, 492 the optimization would be dominated by the linear regression 493 and no effort would be put into the creation of topics.

494
In order to put all predictors on a common scale, (6) also 495 introduces scaling for the regressors and their parameters, 496 which we denoted by a hat: where σ k refers to the standard deviation andθ k denotes the mean over time Rescaling of topics is a novelty of our method which is not 504 present in any other model cited in section II. This rescaling 505 makes all topic time series equivalent for regularized linear 506 regression, thus avoiding excessive penalization of the topics 507 that appear less frequently. 508 We are also assuming that the input data,ŷ t , has been pre-509 viously scaled and its mean removed. We refer the reader to 510 Table 1 for a complete summary of all the variables described 511 above, as well as their domains. In order to optimize (6), we will make use of the expec-  Let us consider the n th iteration of this process (we assume 528 that in the first iteration of the algorithm, n = 0, all the param-529 eters were randomly initialized). In the E-step, we introduce 530 a latent variable z dwk that assigns word w in document d to 531 topic k. Applying the Bayes rule with fixed parameters at the 532 n th iteration, we can estimate the mean value of z

534
This result, which is the same that one would obtain in stan-535 dard LDA, allows us to overcome the difficulty arising from 536 the presence of summation over k that appears inside the log-537 arithm in (6). Indeed, the expected value of the complete-data 538 likelihood can then be written as with z dwk given by (11).
In the first CM-step, we maximize (12) with respect to β This is the well-known solution for β (n+1) kw that is obtained 550 when LDA is optimized by using the EM algorithm, and we 551 have recovered it here in its exact same form. On the other 552 hand, due to the appearance of θ tk in the regression term 553 of (6), which is responsible for the coupling that characterizes 554 TSITM, the computation of θ (n+1) dk is not simple and will be 555 different from the standard result that one would expect in 556 LDA. The optimization problem in this case is where θ dk appears both explicitly and implicitly (through 564 the θ tk and σ k terms in the regression term, defined by (4)   565 and (9)  only appears in the second term of (6), so the task can 581 be reduced to solving an elastic net regression problem like 582 the one described in (2): There is not a closed form solution for this optimiza- here.

594
The E-CM steps described above must be repeated several 595 times until a given convergence criterion for L is satisfied. 596 We summarize the process in Algorithm 1. Presidential ratings are good candidates for finding corre-622 lations with text, since we expect that the support for gov-623 ernments may fluctuate based on political news. In order 624 to avoid as much bias as possible, we will be using the 625 FiveThirtyEight president's disapproval rating, a daily time 626 series which is obtained by averaging a comprehensive set 627 of polls coming from different sources. 3 Polls are weighted 628 based on their methodological standards and historical accu-629 racy, and also adjusted for house effects if they consistently 630 show different results from the polling consensus. According 631 to this averaged rating, the highest peak of the US president's 632 disapproval rating among likely or registered voters during 633 2019 occurred on the Jan 27th, with a 55.58% disapproval 634 rating. Our goal in this experiment is to try to explain what 635 kind of topics drove the disapproval ratings up to these levels 636 according to the TSITM model. In order to do so, we will 637 focus our analysis on the first quarter of 2019 (see Fig. 2).    we differentiate the numerical series to obtain the returns, r t , 683 defined as:

V. DESCRIPTION OF THE DATASETS
We then standardize it by removing the mean and scaling it to 686 unit variance. The resulting time series is shown as the color 687 gray line in Fig. 4 For the preprocessing, we have followed the exact same 710 procedure described in section V-A1. As in the previous case, 711 we have truncated publication timestamps to the unit of days; 712 however, since trading hours for the Nasdaq stock market are 713 from 9:30 a.m. to 4 p.m. (Eastern time) on weekdays, we have 714 assigned the next day as date for all the news published after 715 4 p.m. and we have removed weekends and holidays.

716
After this process, we end up with the following datasets: 717 for the first quarter, there are 1502 documents and a vocabu-718 lary of 3045 terms; for the second quarter, there are 2061 doc-719 uments and a vocabulary of 3677 terms; for the third quarter, 720 there are 1986 documents and a vocabulary of 3712 terms; for 721 the last quarter, there are 2054 documents and a vocabulary of 722 3763 terms. Notice that the amount of news about economic 723 issues is smaller than the amount of news about politics and 724 international relations, but thanks to the scale introduced in 725 (6) we don't expect huge changes in the optimal value of τ . 726  (16). 733 We assume that the series of returns for a given company  There is not a universally valid and objective metric to assess 761 the performance of topic models [31]. However, our frame-762 work provides a simple way to measure the goodness of fit of 763 TSITM thanks to the supervised nature of linear regression. 764 We will train TSITM on a train set composed of documents 765 and the associated time series, and then we will use a test set 766 of documents to predict the values of the time series for them 767 (see Fig. 3 for a system overview). We will then compare 768 the predicted time series with the actually observed values, 769 and compute metrics to measure the prediction error. There 770 are many possible metrics to quantify the quality of a linear 771 regression, but we will restrict ourselves to the three most 772 common:

773
• The mean squared error (MSE), defined as where T test is the total number of observations in the test 776 set,ŷ obs t is the observed value of the time series at time t, 777 andŷ pred t is the value predicted by the model for the time 778 series at time t. Since TSITM optimizes a quadratic type 779 of error (see (6)), the MSE is the main metric that we will 780 be monitoring. The lower the MSE, the better the ability 781 to fit the data.

833
As an external baseline that also takes into account text 834 data, we will consider the sLDA model [14] discussed in 835 section II-B1, since it is the most widespread model for super-836 vised topic modeling tasks and it has been implemented in 837 many programming languages. In particular, we will make 838 use of the tomotopy package for Python, which implements 839 sLDA by making use of Gibbs sampling.

840
However, comparisons between sLDA and TSITM are not 841 straightforward because in sLDA a single prediction is made 842 for each document, while in TSITM we aggregate all the 843 documents published in a given period of time and make a 844 single prediction from them. We have considered two possi-845 ble approaches to make these models comparable.

846
As a first approach, we follow the strategy described in [15] 847 and combine all news articles published in a given period 848 into a single document. We then label that unified document 849 with the time series value for that period. This way, sLDA 850 will make a single prediction for a group of documents, as in 851 TSITM. The second approach to make sLDA comparable with TSITM 854 consists on making different predictions for each document 855 and then averaging the error for each period of time. For 856 example, if n documents are published in the same day, 857 we make n predictions with sLDA, compute the regression 858 errors with respect to the same time series value for each 859 of them, and then average them to get a single error. This 860 resulting error could then be compared with the error that is 861 obtained with TSITM for that day.

863
In order to determine the hyperparameters of TSITM (that 864 is, τ , λ µ , ρ, η and α), we will perform a cross-validation 865 experiment consisting of a repeated random subsampling 866 with 5 samples made of 10% of the training dates. We will 867 choose the combination of hyperparameters that yields the 868 best mean validation error. It is well known that different 869 initializations for the optimization problem of a topic model 870 will yield different results due to the presence of several local 871 minima, so we will previously train 10 different LDA models 872 with different initializations each time and will select the 873 random seed that yields the highest log-likelihood.

874
For an efficient selection of hyperparameters, we have 875 used a Bayesian Optimization [32] with a Gaussian process 876 regression as a method for statistical inference [33] and 877 a probabilistic mixture of negative expected improvement, 878 negative probability of improvement and lower confidence 879 bound as an acquisition function. We used the implemen-880 tation provided by the scikit-optimize package [34]. 881 As a range of possible values for τ (which is the most relevant 882 hyperparameter for this discussion), we have set 0 < τ ≤ 5. 883 For sLDA, we will use the same α and η that was obtained 884 for TSITM to make comparisons as close as possible.  In this section, we present the results for the dataset described 907 in section V-A. As we said before, we have repeated the same 908 experiment 5 times, performing different train-test splits each 909 time. This way, we expect to detect statistically significant 910 signals and exclude accidental correlations.

911
In Table 2   (based solely on 951 the presence of the three topics discussed above over time, 952 θ tk , and their impact coefficientμ k ) is depicted as a black 953 color line in Fig. 4, where we compare it with the observed 954 numerical data,ŷ obs t (after the preprocessing discussed in 955 section V-A2), depicted in gray. 956 6 https://fivethirtyeight.com/features/americans-increasingly-blametrump-for-the-government-shutdown/ 7 https://fivethirtyeight.com/features/independents-trust-mueller-whichcould-be-bad-news-for-trump/ 8 https://www.eia.gov/todayinenergy/detail.php?id=42415  (or equivalently, its R 2 OS is consistently higher). This can be 970 seen by noting that its R 2 OS is positive in 18 out of 20 splits. 971 The MAE is also generally smaller in TSITM than in the 972 intrinsic baseline, although this metric shows less impres-973 sive scores (improvement occurs in 14 out of 20 sets). This 974 result can be understood by noting that TSITM optimizes a 975 quadratic type of error. This means that the model is more 976 sensitive to large response sizes in the time series, while the 977 MAE does not exhibit this property. Note that the signals that 978 can be discovered in financial data are typically very small, 979 so one should not be discouraged by the apparently modest 980 values of R 2 OS that have been obtained with TSITM.

981
With respect to the external baselines (sLDA 982 approaches 1 and 2), we note that the lack of regularization 983 in these models leads to volatile results. While they are the 984 best performant models in some splits (particularly, in the 985 QT2 dataset), they do not exhibit consistent reliability across 986 datasets, as proved by the fact that they are not even able to 987 VOLUME 10, 2022

993
In Table 9 we report the value of τ used for each model, 994 as well as the mean across the different train-test partitions.

995
These optimal values prove that the model benefits from a 996 certain degree of coupling between topic modeling and linear 997 regression. Actually, we see that in some cases the model 998 might have benefited if an even stronger coupling had been 999 allowed (this seems to be the case for the second and third 1000 quarters).

1001
In Tables 5 -8  have only reported the topics with an impact coefficientμ 1005 different from zero, i.e. the topics that present a significant 1006 correlation with the time series according to the TSITM 1007 model. Although we trained these models with a significantly 1008 higher number of topics, the L 1 constraint of the elastic net 1009 regularization drove to zero the coefficient of most of them, 1010 as we discussed in section IV-A. This is a significant asset 1011 in the context of topic modeling and unsupervised learning, 1012 since the determination of the optimal number of topics is 1013 usually a human-made choice. For TSITM, on the other hand, 1014 one simply has to choose a sufficiently large number of topics 1015 (typically, from 10 to 30) and the system drives most of the 1016 coefficients to zero.

1017
Lastly, observe that the resulting topics were expected to 1018 influence big corporations and multinationals like the ones 1019 that we have analyzed: the top words reveal that the stock 1020 values of these companies are sensible to international com-1021 merce agreements, tariffs, investments, taxes and other eco-1022 nomic topics that appear consistently in all the experiments. 1023

1024
In the previous two sections, we have observed that TSITM 1025 tends to perform better than the baselines. However, when 1026 comparing two group of samples, the difference in their 1027 values may be a result of random variations. In order to claim 1028 that the results are significant, we need a statistical proof that 1029 our model shows an advantage over the baselines. Hypothesis 1030 testing will be used to discard the possibility that differences 1031 between models are accidental.

1032
A standard statistical methodology to perform hypothesis 1033 testing for model comparison is the following [35]. Let M 1034 and M be two models, and let X i and X i be the respective 1035 metric values for each model on the ith data split (i = 1 . . . N , 1036 where N is the total number of data splits). We can define 1037 the differences in metrics for each split as δx i = X i − X i . 1038 The average differenceδ x and the standard deviation σ δ x are 1039 estimated by 1042 and the error on the estimated average is σ δ x / √ N . To perform 1043 hypothesis testing, we assume that the differences δx 1 . . . δx N 1044 are sampled from a Student's t-distribution with (N − 1) 1045 degrees of freedom, The null-hypothesis is that the mean µ is 0 (i.e. there is 1048 no significant difference between models). We compute t 1049 for each dataset and check the corresponding p-value for a 1050 one-tailed test (we want to test whether TSITM performs bet-1051 ter than each baseline). A p-value of 0.05 or less is typically 1052 considered as strong evidence that the null-hypothesis can be 1053 discarded, although a p-value as high as 0.10 could also be accepted.
Results of the hypothesis testing are shown in Table 10,   1056 where we report the p-value that TSITM performs better than 1057 each baseline for each metric in each of the 5 datasets con-1058 sidered (the politics dataset and the four economics datasets), 1059 as well as the total results where we consider all the splits 1060 from all datasets combined. We observe that our results 1061 exhibit statistical significance in most cases. It is specially 1062 interesting to look at the results for R 2 OS , since this metric TSITM's errors were consistently smaller than those 1104 of the baseline models (see Tables 2 and 4, as well as 1105 the hypothesis testing in Notice that we have included standardization of the inputs 1112 as a core feature of TSITM. Standardization is a crucial step, 1113 since the solutions to a regularized linear regression problem 1114 are not equivariant under scaling of the input. This feature 1115 was not present in many of the previous works discussed in 1116 section II, such as [19].

1117
Compared to other text regression approaches that do not 1118 make use of topics (such as those based on neural networks 1119 and news sentiments [36], [37], [38]), the main advantage 1120 of TSITM is the improved interpretability. Topic modeling 1121 allows us to not only make predictions of unobserved values 1122 of the time series, but also to understand the reasons behind 1123 the predictions and explain why certain news impacted the 1124 time series. 1125 We believe that this model may contribute to the quan-1126 tification of the concept of reputational risk by providing a 1127 strategy to objectively determine how much the value of a 1128 certain entity (reflected by its stock values or popularity rates) 1129 has been affected by adverse events or damages to the entity's 1130 reputation. TSITM could also be employed for reputation 1131 polarity analysis [39], one of the core tasks of Online Rep-1132 utation Management, which consists on determining if the 1133 publication of a text about an entity will impact positively 1134 or negatively the entity's reputation.

1136
A potential limitation of our experimental framework is that 1137 parameters have been optimized with the goal of discovering 1138 signals across the entire period, regardless of its position. This 1139 could be undesirable if the main goal were to forecast the 1140 future (for example, for stock trading purposes) instead of 1141 understanding the past, since the resulting signals may be 1142 far from the last point of the time series. However, as we 1143 explicitly stated in section III, this was not our objective.

1144
Notice that in this work, we do not claim that TSITM is 1145 the best way to explain the proposed series (either financial 1146 returns or disapproval ratings) from the available information 1147 at a given time. For example, presidential disapproval ratings 1148 show clear autoregressive features, and the previous day's 1149 rate could be used together with news to improve the results. 1150 A similar case could occur with stock market data when there 1151 is a varying trend superimposed on the returns. We leave these 1152 questions open for a further research work.

1153
Lastly, it must be noted that, as we increase τ , the number 1154 of iterations required to achieve a certain degree of conver-1155 gence for L increases. This is illustrated in Fig. 5, where 1156 we plot, for the politics dataset, the number of ECM steps 1157 given to obtain the same rate of convergence in L for different 1158 values of τ , while keeping all the remaining hyperparameters 1159 fixed. The reason behind this behavior is the fact that the 1160 optimization of θ dk involves a numerical algorithm (rather 1161 VOLUME 10, 2022  companies for which relevant signals were discovered using 1186 a corpus made of economic news; however, we also found 1187 other companies for which no clear correlations with this type 1188 of news were found, yielding a MSE compatible with noise. 1189 This is natural, since not all companies are expected to be 1190 impacted by the same type of news, but it highlights the fact 1191 that more elaborated filters may have to be used if one wants 1192 to perform a fine-grained analysis of a particular entity.

1193
Choosing the appropriate time series also requires domain 1194 knowledge. For example, in [40] they use the output of an 1195 LDA topic decomposition applied to a financial news dataset 1196 as features of a naïve Bayes classifier in order to predict the 1197 ups and downs of asset volatilities and close prices. They 1198 found a significant signal in the case of volatilities, while their 1199 prediction for close prices changes didn't do better than ran-1200 dom choice. Their work is close in spirit to our research, but 1201 they only try to classify ups/downs and there is no coupling 1202 between the time series and the topics, in the sense that topics 1203 are trained alone.

1204
When dealing with a new dataset, selection of hyperpa-1205 rameters is a crucial step. As described in section VI-C, 1206 we used Bayesian Optimization for this task. For the initial 1207 setting of the optimization space of the topic hyperparame-1208 ters, we suggest using the range With respect to the number of topics, K , for a new dataset, 1219 we suggest fixing it before applying the Bayesian Optimiza-1220 tion for the rest of hyperparameters. We do so by training 1221 several LDA models with different K and then calculating the 1222 reconstruction error over a test set for each of them. We define 1223 this error as the Kullback-Leibler divergence between the 1224 observed documents and the ''reconstructed'' documents (i.e. 1225 K k=1 θ dk β kw ). The rate of improvement on this error as more 1226 topics are added is then compared with the one that would 1227 be obtained on a noisy synthetic dataset, and an optimal K 1228 is determined. This optimal K can also be employed for any other method used as baseline (such as sLDA).  us to obtain topics specifically customized to fit a particular 1280 time series. The coupling hyperparameter, τ , was used for weighting the relative importance of the regression term with 1282 respect to the topic model.

1283
A strategy to empirically determine the appropriate values 1284 for the hyperparameters was discussed and successful appli-1285 cations of the algorithm were demonstrated in two different 1286 domains: with the US president's disapproval ratings and 1287 political news, we focused on one specific period to illus-1288 trate how the obtained results were easily interpretable and 1289 consistent with our expectations; with stock market data and 1290 economic news, we proved that it is possible to consistently 1291 find reliable signals across different corpora and time series. 1292 Both the intrinsic baseline and the two approaches for sLDA 1293 used as state of the art baselines were significantly outper-1294 formed by TSITM in MSE, MAE and R 2 OS , according to our 1295 hypothesis tests.

1296
The results of these experiments proved that the three 1297 research goals proposed in section III were fulfilled by the 1298 TSITM model. The results also illustrated that TSITM could 1299 be used to quantitatively determine which topics damage the 1300 reputation of a given entity. where f : R n → R is closed proper convex and differentiable 1310 and g : R n → R ∪ ∞ is closed proper convex (constraints on 1311 x are typically encoded on g). The proximal gradient method 1312 for optimizing (25) consists on applying iteratively steps of 1313 the form 1314 where λ (l) is a step size chosen in each step and the proximal 1316 operator, prox λg (v) : R n → R n , is defined as The proximal operator prox λg (v) can be seen as a compro-1319 mise between minimizing g and being near to v. The step size 1320 is typically chosen with a line search such as the backtracking 1321 rule proposed in [45], which we reproduce in Algorithm 2 1322 with the notation used in [25]. For the stopping condition 1323 in Algorithm 2, the following upper bound of f ,f λ , was 1324 introduced: To see why the proximal gradient algorithm is suitable for 1327 our purposes, let us rewrite the objective function in (14) 1328 (which we precede here by a minus sign, since the proximal 1329 Algorithm 2 Backtracking Rule for the Determination of λ (l) 1: given x (l) , λ (l−1) and parameter β ∈ (0, 1). 2: Let λ = λ (l−1) . 3: Let z = prox λg x (l) − λ∇f x (l) . 4: while f (z) ≤f λ z, x (l) do 5: Update λ = βλ. 6: Let z = prox λg x (l) − λ∇f x (l) . 7: end while 8: gradient algorithm is typically framed as a minimization

1341
and we have encoded the constraints of (14) into (31) through 1342 the function δ S (θ) which is zero in the region of allowed 1343 parameters, S, and infinite elsewhere, that is, algorithm, the proximal operator will be iteratively applied 1351 until convergence, so we will denote by l each step of the 1352 proximal gradient algorithm to avoid notation ambiguities.

1353
Also notice that, to simplify, we have written θ dk , z dwk and 1354 µ k to refer to θ By taking derivatives of (42) with respect to θ dk and setting 1375 them equal to zero, we finally obtain Equation (43) is the final form of each of the steps that must 1380 be taken in the proximal gradient algorithm. However, there 1381 are two parameters that remain to be determined in (43): the 1382 set of Lagrange multipliers µ d and the step size λ (l) .

1383
To determine the value of the Lagrange multipliers, recall 1384 that solution (43) must satisfy the condition K k=1 θ dk = 1, 1385 which implies that where we introduced the following definitions From (45), µ d can be numerically determined by using the a detailed description of this solution.