Feature Engineering for Mid-Price Prediction with Deep Learning

Mid-price movement prediction based on limit order book (LOB) data is a challenging task due to the complexity and dynamics of the LOB. So far, there have been very limited attempts for extracting relevant features based on LOB data. In this paper, we address this problem by designing a new set of handcrafted features and performing an extensive experimental evaluation on both liquid and illiquid stocks. More specifically, we implement a new set of econometrical features that capture statistical properties of the underlying securities for the task of mid-price prediction. Moreover, we develop a new experimental protocol for online learning that treats the task as a multi-objective optimization problem and predicts i) the direction of the next price movement and ii) the number of order book events that occur until the change takes place. In order to predict the mid-price movement, the features are fed into nine different deep learning models based on multi-layer perceptrons (MLP), convolutional neural networks (CNN) and long short-term memory (LSTM) neural networks. The performance of the proposed method is then evaluated on liquid and illiquid stocks, which are based on TotalView-ITCH US and Nordic stocks, respectively. For some stocks, results suggest that the correct choice of a feature set and a model can lead to the successful prediction of how long it takes to have a stock price movement.


Introduction
The automation of financial markets increased the complexity of information analysis. This complexity can be effectively managed by the use of ordered trading universes like the limit order book (LOB). LOB is a formation which translates the daily unexecuted trading activity in price levels according to the type of orders (i.e., bid and ask side). The daily trading activity is a big data problem since millions of trading events take place inside a trading session. Information extraction and digital signal (i.e., time series) analysis from every trading session provide the machine learning (ML) trader with useful instructions for orders, executions, and cancellations of trades.
Traditional time series analysis methods failed to capture the complexity of the contemporary trading markets adequately. For instance, the works by Qian (2017) and Siami-Namini & Namin (2018) suggest that classical machine learning and deep learning methods for financial metric predictions achieve better results compared to ARIMA and GARCH models. Machine and deep learning methods proved to be very effective mechanisms for time series analysis and prediction (e.g., Chen et al. (2018), Nousi et al. (2018), Sirignano & Cont (2018)). The main advantage of these methods is their ability to capture non-linearities of the input data and filter them consecutively by creating new weighted features more relevant to the suggested problem.
Email addresses: adamantios.ntakaris@tuni.fi (Adamantios Ntakaris), GMI@nationalbanken.dk (Giorgio Mirone) Disclaimer: The views and conclusions expressed in this paper are solely those of the authors and do not necessarily reflect the views of Danmarks Nationalbank.
Despite their efficacy to predict time series, machine and deep learning methods are developed mainly based on empirical testing. The majority of the literature (e.g., Velay & Daniel (2018), Dash & Dash (2016), Gudelek et al. (2017)) which is based on deep learning frameworks rely only on raw data or on a limited number of features. So far, very little attention has been paid to the information that a neural network should analyze for the mid-price prediction task. In this paper, we shed light on the information that the ML trader should consider to be useful for the task of mid-price movement prediction. For this reason, we introduce in the literature features based on econometrics and make a head-to-head comparison with indicators derived from technical and quantitative analysis (i.e., Ntakaris et al. (2018a)), and time-sensitive and time-insensitive features (i.e., Kercheval & Zhang (2015) and Ntakaris et al. (2018b)).
We choose econometrics as motivation for our handcrafted features since it is the field of financial engineering that can capture the empirical evidence of microstructure noise and causality of the data mathematically. Our data comes with variation in prices, which is known in the financial literature as volatility, a measure that we incorporate into our handcrafted features. Despite the general perception in the academic literature that volatility itself is not a factor that affects stock returns, many evidence exists to support the opposite. For instance, Guo (2004) finds that volatility together with other proxies that are not directly observable on the data, like liquidity premium, affect stock returns. In the same direction, Lettau & Ludvigson (2001) provides evidence that consumption-to-wealth ratio offers information for excess stock market returns where volatility explains a significant portion of these returns. Another example is the work by Chung & Chuwonganant (2018), where authors find strong evidence that market volatility affects individual stock returns and this relies on the fact how the liquidity of individual stocks counters unanticipated changes in market volatility. As a result, we believe that these are reliable indicators to consider econometrics as features for the task of mid-price movement prediction.
We perform our analysis based on deep learning models which recently proposed for financial time series analysis and vary from multi-layer perceptrons (MLP), to convolutional neural networks (CNN) and recurrent neural networks (RNN) like long short-term memory (LSTM) neural networks. For our experiments, we use two TotalView-ITCH datasets from the US and the Nordic stock markets. We formulate these experiments based on two protocols: the first one (i.e., "Protocol I" in our experiments), introduced here for the first time, and it is based on online learning. The prediction of the mid-price movement is taking place every next event, and it is treated as a multi-objective optimization problem since it predicts when and in which direction the mid-price movement will happen. The second protocol (i.e., "Protocol II" in our experiments) is an existing protocol based on the work of Tsantekidis et al. (2018) where the mid-price movement prediction is treated as a three-class classification problem (i.e., up, down or stationary mid-price states) for every next 10 th event.
The main contribution of our work lies on three pillars. The first pillar refers to the introduction of econometric features, for the first time in the literature, as inputs to deep learning models for mid-price movement prediction. The second pillar is related to an extensive evaluation of the newly introduced features with two other feature sets from the literature. We conduct a fair evaluation of these feature sets since we use the same nine deep learning models for liquid and illiquid stocks, unbalanced and balanced feature sets and test them not only on the newly introduced experimental protocol but also with a protocol that suggested in the literature for the Nordic dataset that we also utilize here. The third pillar refers to a new experimental protocol that takes into consideration every trading event, and it is unaffected from time irregularities that high-frequency data has. Our work suggests that feature extraction should be customized according to stock and model selection.
The remainder of the paper is organized as follows. We provide a comprehensive literature review in Section 2. The problem statement is provided in Section 3. The list of handcrafted features follows in Section 4. In Section 5, we describe the various deep learning models adopted in our analysis, while in Section 6 we describe details of the datasets and the experimental protocol. In Section 7 we provide the empirical results, and Section 8 concludes the paper. A detailed description of the econometric features used in our experiments, are provided in Appendix together with results for Protocol II.

Literature Review
High-frequency LOB data analysis captured the interest of the machine learning community. The complex and chaotic behavior of the data inflow gave space to the use of non-linear methods like the ones that we see in the machine and deep learning. For instance, Zhang et al. (2019) utilize neural networks for the prediction of Baltic Dry index and provide a head-to-head comparison with econometric models. Sirignano (2016) develops a new type of deep neural network which captures the local behavior of a LOB for spatial distribution modeling. RNN applied by Dixon Dixon (2018) on S&P500 E-mini futures data for a metric prediction like price change forecasting. RNN architecture also proposed by Minh et al. (2018) for short-term stock predictions by utilizing successfully financial news and sentiment dictionary. Zhang et al. (2018) apply a combined neural network model based on CNN and RNN for mid-price prediction.
Metrics prediction, like mid-price, can be facilitated by the use of handcrafted features. Handcrafted features reveal hidden information as they are capable of translating time-series signals to meaningful trading instructions for the ML trader. Several authors worked towards this direction, like Kercheval & Zhang (2015), Passalis et al. (2017), Ntakaris et al. (2018b), Tran et al. (2017), Tran et al. (2018), Zheng et al. (2012) and Sirignano (2016). These works presented a limited set of features which varies from raw LOB data to change of price densities and imbalance volume metrics. Another work that provides a wider range of features is the one presented by Ntakaris et al. (2018a). The authors there extracted handcrafted features based on the majority of the technical indicators and developed a new quantitative feature based on logistic regression which outperformed the suggested feature list.
Handcrafted features represent only one part of the experimental protocol in the quest of mid-price prediction. Classification, via deep learning methods, is the continuation of a machine learning protocol. Many authors used deep learning in financial literature for several problems. For example, Alberg & Lipton (2017) use MLPs and RNNs for companies' future fundamentals forecasting. Qian (2017) utilizes machine and deep learning methods, like support vector machines (SVM), MLPs, denoising auto-encoder (DAE), and an assembled DAE-SVM model in order to predict future trends of stock's index prices. These machine and deep learning models outperformed traditional time series models like ARIMA and generalized autoregressive conditional heteroskedasticity (GARCH). Sezer et al. (2017) use MLPs and the three most commonly used technical indicators as inputs for stock price movement predictions.
Many authors utilize LOB data as input to their models. For instance, Nousi et al. (2018) examine the performance of several machine learning methods, like autoencoders (AE), bag-of-features algorithm, single hidden layer feedforward neural networks (SLFN) and MLPs for mid-price prediction. Han et al. (2015) apply decision trees on LOB data and outperform support vector machines (SVM) for the problem of mid-price prediction. In the same direction Kanagal et al. (2017) apply similar methods on market order book data for market movement predictions. Doering et al. (2017) utilize event flow and limit order datasets for price-trend and price-volatility predictions based on a deep learning architecture. Mäkinen et al. (2018) predict price jumps with the use of LSTM where the input data is based on LOB data. A similar work, in terms of the neural model, was contacted by Tsantekidis et al. (2017b) in order to forecast LOB's mid-price.
To the best of our knowledge, this is the first time that econometrics features based on high-frequency LOB data proposed as input to several neural networks for mid-price prediction. A head-to-head comparison with an extensive list of other features is conducted, and results for both balanced and unbalanced sets reported based on two massive high-frequency datasets with US and Nordic stocks.

Problem Statement
The problem under consideration is the mid-price movement prediction based on high-frequency LOB data. More specifically, we use message and limit order books as input for the suggested handcrafted features. Message book (MB) as can be seen in Table 1, contains the flow of information which takes place at every event occurrence. The information displayed by every incoming event includes the timestamp of the order, execution or cancellation, the id of the trade, the price, the volume, the type of the event (i.e., order, execution or cancellation) and the side of the event (i.e., ask or bid).
LOB (Table 2) works under specific rules based on the operation of the trading system-exchange. The main advantage of an order book is that it accepts orders under limits (i.e., limit orders) and market orders. The former case is when the trader/broker is willing to sell or buy a financial instrument under a specific price or better where the latter case refers to the action of buying or selling a stock at the current price. LOBs accept orders by the liquidity providers who submit limit orders and the liquidity takers who submit market orders. These limit orders, which represent the unexecuted trading activity until a market order arrives or cancellation takes place, construct the LOB which is divided into levels.
The best level consists of the highest bid and the lowest ask price orders, and their average price defines the so-called mid-price, the metric that we try to predict its movement. We treat the mid-price movement prediction as a multi-objective optimization problem with two outputs -one is related to classification and the other one to regression. The first part of our objective is to classify whether the mid-price will go up or down and the second part of our objective -the regression part -is to predict how many events in the future this movement will happen. To further explain this, let us give the following example: in order to extract the intraday labels we measure, from time t k , in how many events the mid-price will change and in which direction (i.e., up or down). For instance, the mid-price will change in 10 events from now, and it will go up, that means that our label at time k is going to be {1,10}, where 1 is the direction of mid-price and 10 is the number of events that need to pass in order to see that movement taking place.
This labeling system is the basis in order to answer the critical question of whether handcrafted features derived from econometrics can boost deep learning classification and regression performance. We conduct extensive experiments based on nine neural topologies (i.e., five MLPs, two CNNs, and two LSTMs), two TotalView-ITCH datasets and compare the performance of econometric features to two other features sets from the literature. The first set is based on time-sensitive and time-insensitive features as these presented by Kercheval & Zhang (2015) and Ntakaris et al. (2018b) and the second feature set, which is based on technical and quantitative analysis, introduced by Ntakaris et al. (2018a).

Feature Pool
In this section we provide the nominal list in Table 3 of the newly introduced econometric features for mid-price prediction together with two other state-of-the-art features sets from the literature which are based on technical and quantitative analysis and time-insensitive and time-sensitive indicators. The description of the econometric features can be found in Appendix where the description of technical and quantitative feature set and time-sensitive and time-insensitive set extracted from the LOB can be found in the work by Ntakaris et al. (2018a). We extract our econometric features, from both MB and LOB, and divide them into four main cat-   Ntakaris et al. (2018a) egories: Statistical features, volatility measures, noise measures, and price discovery features. The first category comprehends basic statistical features which are widely used in the literature (e.g., Kercheval & Zhang (2015), Sirignano (2016)). The primary driver in the choice of the volatility measures features set the intimate relation between the volatility of the price process and the price movement itself. As such, we deem the volatility measures included in the present article to retain information useful to real-time price prediction. This is particularly true when the predicted objective is the next price movement. Additionally, the econometric literature widely evidences the significant detrimental impact of the so-called microstructure noise in the measurement of fundamental quantities when working at the highest frequencies. Furthermore, the noise process directly affects the underlying price process itself and as such contributes to the observed price movements. For these reasons we implement a number of estimates of the characteristics of the noise process, which we identify as the noise measures features set. 1 The last group of features includes all those features related to the price discovery process; i.e., those that take into account the interaction of the two sides of the LOB. Several articles in the literature (e.g., Ntakaris et al. (2018a), Mäkinen et al. (2018)) have focused and demonstrated the importance of accounting for the differences between the ask and bid side in order to improve the mid-price forecasting accuracy.

Deep Learning
The goal of this paper is to forecast the movement of the mid-price. The predicted output has dual information, one is for the direction of the mid-price movement and the second one is to predict how many events take the mid-price to move up or down. A very efficient way to do that is by using deep learning architectures. We consider and run separately three different neural networks types (i.e., MLPs, CNNs, and LSTMs) and examine their validity to our optimization problem.

MLP for Classification and Regression
MLP is a type of neural network that shows a high degree of connectivity among its components (i.e., neurons). The strength of this connectivity is determined by the synaptic weights of the neural network. These synaptic weights are determined by a nonlinear activation function which is differentiable. These basic characteristics of the neural network are the reason why it can be complicated to analyze MLPs' behavior. As a result, several MLP architectures have to be examined in order to see whether input data (i.e., handcrafted features) plays a role in the outcome/prediction. The way that an MLP can be trained is based on a sequential data feeding process called batch learning. Batch learning is a process where the neural network adjusts the synaptic weights after the presentation of all the samples in the training process, where x(i) is the input multi-dimensional vector and d(i) the response vector of the supervised problem at instance i where the error function at instance i is: where d (i) is the i th element of the d(i) and y (i) is the produced output term at instance i. The error function that we use for our experiments is bespoke to our supervised problem and its components are based on the binary cross entropy (for the classification task) and the mean squared error (for the regression task), as follows: parameter λ, where y (i) andŷ (i) be the ground truth and the predicted values of the i th training sample which belongs to R N , respectively. This customized function is part of the backpropagation algorithm that helps the neural network (e.g. MLP) to correct the synaptic weights in order to optimize Eq. 2. Backpropagation in our case follows the automatic differentiation (AD) reverse mode (i.e., Baydin et al. (2015)). Reverse AD facilitates the process of correcting the synaptic weights and it can be done as follows: Initially we define the input variables as .., 0 be the output variables. Derivatives calculation is a two step process. During the first phase the intermediate variables v i are populated and create the graph trace where during the second phase derivatives are calculated based on the propagation of the adjointsv i = ∂y l ∂vi . In general the reverse AD performs the calculations from the output to the input starting from the output as seed: and moves to the inputs via the intermediate states based on the calculation: where P a(j) denotes the parent formation of node j and g j the intermediate functions of the graph. The next part of the MLP training is the learning process, which is defined as the method through which the loss function will reach the optimal solution via proper parameters updates. For this reason we choose the Nesterov accelerated gradient (NAG) method incorporated into the adaptive moment estimation (Adam) method named as Nadam by Dozat (2016). Nadam applies the momentum step only once, and it is also takes into consideration the current momentum (rather than the previous momentum vector) vectors. This gives us the Nadam update parameters rule: where the first (i.e., mean) and second (i.e., variance) moment for the current momentum vector are, respectively:m θt L all (θ t ) and learning rate η = 0.002.

CNN for Classification and Regression
CNN, as described in Goodfellow et al. (2016), is a type of neural network that can handle time series of multidimensional data for metric prediction. The main motivation for choosing this type of neural network is its capability for sparse connectivity between neural layers, for sharing the so-called tied weights and equivariant representations properties. More specifically, sparse connectivity can be achieved by using a kernel smaller than the sample input. This action reduces the amount of memory that is required for the training process. The second advantage that CNN has is the use of tied weights. Tied weights are shared among the input since the same amount of weights is applied to the inputs. CNN has three main parts: the convolution layer, the pooling layer, and the fully connected layer. The convolution layer extracts features from the input multi-dimensional signal expressed usually as a tensor or matrix. This process creates linear activations which run via a non-linear activation function such as the rectified linear activation function (ReLU) and the Leaky ReLU. Then the pooling layer will convert the local output based on a summary statistic related to the local outputs (e.g. maxpooling). The last step of the process is the connection to the fully connected layers that will perform the classification and regression tasks. These tasks are based on discrete time series events which formulate the (forward) convolution layer calculation as follows: where H, D,and D are the row, columns and depth dimension of the input tensor x ∈ R H l ×W l ×D l respectively, f ∈ R H l ×W l ×D l is the filter bank, and the indexing (i l+1 + i, j l+1 + j, d) refers to the iterative local convolution of the filter bank on the suggested input for the l-layer. Pooling is performed right after convolution where for our experiments we choose the formation of max pooling. Last step is the use of fully connected layers. The structure of these fully connected layers is the same like in Sec. 5.1. The process that we follow in order to train our CNN parameters (i.e., filter banks and synaptic/tied weights) is based on batch learning combined with reverse AD (i.e., backpropagation) as we did for the MLP case.

LSTM for Classification and Regression
Time series requires from the ML trader to take into consideration their temporal behaviour. The events that we have to deal with in the LOB universe are formed in such sequential manner. Sequential systems, like RNNs, are based on computational graphs and they are ideal for time series analysis. RNNs provide much flexibility in terms of architecture formation where the basic idea can be described in Eq. 6: where h and x are the state and the input at time t and θ are the shared parameters for a transition function f at time t. Since we use RNN for empirical calculations we choose to forecast mid-price by using gated RNNs (named LSTM) as presented by Hochreiter & Schmidhuber (1997). Motivation for choosing these type of gated RNNs is their ability to create connections through time and account for the problem of vanishing (or exploding) gradients. Instead of applying just element-wise nonlinear input transformations, LSTM units, contain processes which take into consideration the sequential nature of time series. More specifically, an LSTM cell is equipped with gates which will filter the information flow by applying weights internally. The first pass is the forget gate vector f t i : i are the current input and hidden state vectors of cell i at time t, respectively. The attached weight matrices to these vectors are W f and U f for the forget gate vector with b f the bias term. The next pass is related to the information that is going to be saved to the so-called "cell state". The cell state can be divided in two parts -the input vector and a tanh layer as follows: where g (t) is the input gate: The last remaining part is the filtered output. More specifically, the LSTM output/hidden state will be formulated by the output gate vector o i which can be calculated as follows: and the final output h (t) i is equal to: The formation above refers to the case of a typical LSTM neural network, which we implement in Section 6. We also apply an attention mechanism to the LSTM architecture in order to weight/measure the significance of the input sequence. We follow the implementation in Zhou et al. (2016) and Mäkinen et al. (2018) where the sequential LSTM outputs (i.e., hidden states H (t) , t ∈ {1, ..., T }) are filtered via the following steps for every K-dimensional vector w: and the final LSTM with attention output is: Again here we use the same backpropagation mechanism as we did for MLPs.

Data Description and Experimental Protocols
Our objective is to provide informative handcrafted features to ML traders and market makers for the task of mid-price movement prediction. Prediction of this movement requires in-depth analysis in terms of data selection (e.g., liquid or illiquid stocks) and experimental protocol development. For these reasons, our analysis consists of two TotalView-ITCH from the US and Nordic stock markets and two experimental protocols. The first protocol, named Protocol I, is based on online prediction, and we introduce it here for the first time where the second protocol, named Protocol II, derived from the literature (i.e., Tsantekidis et al. (2018)), and it is based on mid-price movement prediction with 10-event lag.

Data
We utilize two massive TotalView-ITCH datasets based on the US and Nordic stock markets. The time resolution of the datasets is in milliseconds. For the US datasets, we use two stocks, Amazon and Google, where for the Nordic dataset we use Kesko Oyj, Outokumpu Oyj, Sampo Oyj, Rautaruukki, and Wartsila Oyj. We use ten business days for both datasets covering the period: from 22.09.15 to 05.10.15 for the US dataset and from 01.06.10 to 14.06.10 for the Nordic dataset. The trading activity for these ten business days is 13,000,000 events for the US dataset and 4,000,000 events for the Nordic dataset. We use MBs in order to create relevant LOBs. We utilize super clustering computational power based on HP Apollo 6000 XL230a/SL230s supercluster to convert MBs to LOBs (i.e., LOBs are of depth 10 for both sides). We follow several pre-processing steps before we start training the deep learning models. A general description of the pre-processing process can be seen in Fig. 1 Figure 1: This is a higher level explanation of the steps that we follow for the present analysis. From left to right: The first step is to obtain the datasets for the US and Nordic stock markets. The second step is to convert the flow of trading activity (i.e., message books) to LOB format with depth of 10 based on Python scripts. The third step is to use the LOB datasets twice in order to create the three feature representation sets for the two experimental protocols (i.e., Protocol I for online forecasting and Protocol II for forecasting with lag of 10 events) based on MATLAB scripts. The fourth step is the data preparation for the server based on the HDF5 format. The fifth step is to submit the deep learning Python scripts to the GPU servers at CSC superclusters, and the sixth step is to adjust the deep learning Python scripts based on the suggested protocols.
6.2. Protocol I MBs and LOBs are the inputs for the creation of the econometric, technical and quantitative features, and time-sensitive and time-insensitive LOB indicators. Both datasets convey asynchronous information varying from events taking place at the same millisecond to events several minutes apart from each other. In order to address this issue, we develop Protocol I which utilizes all the given events in an online manner. More specifically, our protocol extracts feature representation every ten events with an overlap of nine events for every next feature representation. A visual description of our protocol can be seen in plot (a) in Fig. 2. The problem under consideration in Protocol I is to predict the movement of mid-price (i.e., classification: up or down) together with the number of events it takes for that movement to occur in the future (i.e., regression: number of events until next mid-price's movement change). More specifically, we utilize, for testing performance evaluation, f1 score for the classification task and RMSE (i.e., Root Mean Square Error) for the regression task. F1 score is defined as: with Recall = T P T P + F N and P recision = T P T P + F P where T P , F N , and F P are the True-Positives, False-Negatives, and False-Positives, respectively and RMSE is defined as: where P i and O i are the predicted and observed values of n samples, respectively.
We have a labeling system that requires classification and regression. The first part of the dual labeling format contains the binary information 1 and -1 for the up and down mid-price movement, respectively. The second part of the labeling format represents the discretization of the numeric data expressed as the steps until the next mid-price change. A pictorial sample example of the above labelling system is in Fig. 3. The label extraction can be described as follows: with N be the number of the mid-prices (MP) samples, 2. L(p) ≤ d(i) < L(p + 1), 1 ≤ p < Q, where L(p) is a vector which contains the bin limits in a monotonically increasing order and Q is the numbers of bins which is equal to the total number of the non-zero elements in the vector of mid-price differences.

Protocol II
On the other hand, Protocol II is based on independent 10-event blocks for the creation of the feature representations as this can be seen in the plot (b) in Fig. 2. More specifically, feature representations are based on the information that can be extracted from 10 events each time with these 10-event blocks been independent of each other. Protocol II treats the problem of mid-price movement prediction as a three-class classification problem, where the three states are up, down, and stationary condition for the mid-price movement. These changes in the mid-price are defined based on the following calculations: where M P (t) is the mid-price at time t, m a (t) = 1 r r i=1 M P (t+1) is the average of the future mid-price events with window size r = 10, and α determines the significance of the mid-price movement which is equal to 2 × 10 −5 .

Data Normalization and Filtering
The next step of the pre-processing step is data normalization. We perform two different normalization methods during the feature extraction process and training. The first one is a statistical filtering method while the second is based on MinMax. More specifically, the first normalization setup applied directly on the raw MB data. We utilize a statistical filtering method. Before proceeding with any estimation, we proceed by eliminating any observation which does not reflect market activity. In the financial econometrics literature this is often referred to data cleaning and its importance has been widely discussed in the litearture (e.g., Dacorogna et al. (2001), Brownlees & Gallo (2006), and Barndorff-Nielsen et al. (2009)). 2 We filter the raw data for outliers following a multi-step procedure. We initially remove all transactions recorded outside official trading time and misrecorded transactions. 3 As the last step of the cleaning procedure, we implement a more elaborated filtering algorithm. We take into account the statistical properties of the series and assess the validity of each observation according to its likelihood of being an outlier. 4 Specifically, defined a window size k, we identify a set of (centered) neighboring observations for each data point. To avoid including prices too distant in time, the window size k should be chosen according to the trading intensity of the series. We then compute the trimmed mean of the neighboring set and mark as an outlier the considered observation if it falls more than α + γ standard deviations away from the neighbors' mean. Where γ is a granularity parameter, which should be chosen as a multiple of the tick size. The idea behind γ is to create a lower positive bound for the price variation. This is particularly important for the cleaning procedure as it is not uncommon to observe a sequence of equal mid prices in the LOB, which would lead to a zero variance and a consequent rejection of every price different from the mean value. The second normalization setup is based on MinMax for the handcrafted features, as follows: where N is the total sample size for every feature vector X and X (i) is the i th element of X.

Results & Discussion
In this section, we provide results regarding the contacted experiments which are based on two massive LOB datasets from the US (i.e., two stocks: Amazon and Google) and Nordic (i.e., five stocks: Kesko Oyj, Outokumpu Oyj, Sampo Oyj, Rautaruukki, Wartsila Oyj) stock markets. The main objective of this section is to shed light on the handcrafted feature extraction universe for mid-price movement prediction. We make a head-to-head comparison of three feature sets, where the first one named "Limit Order Book (LOB):L" is based on the works of Kercheval & Zhang (2015) and Ntakaris et al. (2018b), the second one, named "Tech-Quant:T-Q", is based on Ntakaris et al. (2018a), and the last feature set, named "Econ:E", uses econometric features and is presented here for the first time in machine learning literature.
In order to scrutinize the effectiveness of the handcrafted features, we use two experimental protocols, nine deep learning models and present results based on unbalanced and balanced inputs. In particular, we test the three feature sets based on two protocols: the newly introduce experimental protocol (i.e., Protocol I) for online learning, as we explain in Section 6 and Protocol II which is based on Tsantekidis et al. (2018). Protocol I is suitable for online learning where the main objective is to predict when a change in the mid-price will happen (i.e., regression problem) and in which direction, for instance, up or down (i.e., two-class classification problem). On the other hand, Protocol II predicts the mid-price movement direction for every next 10 th where feature representations extracted based on independent 10-event blocks. Authors, there used a joint training set of the five Nordic stocks for seven trading days and the next three days as testing for mid-price movement prediction (i.e., up, down, and stationary movement). We incorporate the same idea here under the name "Joint", and we also use the same 7-3 training and testing proportion for each stock individually but for both US and Nordic datasets. A general idea for both protocols can be seen in Fig. 4.
Protocol I and Protocol II use three types of deep neural networks as classifiers and regressors. In particular, we utilize five different MLPs, two CNNs, and two LSTMs. The motivation for choosing MLPs is the fact that such a simple neural network can perform extremely well when descriptive handcrafted features are used as input. We report results for all the five MLPs because of their versatile nature in terms of performance. The next type of neural network that we use is CNN. The first CNN, named "CNN 1" is based on Tsantekidis et al. (2017a) where the second one, named "CNN 2" is an improved version of CNN 1 with deeper topology. The last type of neural network that we utilize is LSTM. We use two different architectures: the first one, named "LSTM 1", is based on Tsantekidis et al. (2018) and the second one, named "LSTM 2" is based on LSTM with attention mechanism. In total, we train independently nine deep neural networks for every one of the two experimental protocols separately. Details of these nine topologies can be found in Table 4.
The training of these nine neural networks takes place at CSC super-cluster where we use Pascal P100 and K80 GPUs. We use multi-GPUs, under Keras (i.e., Chollet et al. (2015)) framework, in order to reduce the training time. The models, apart from CNN 1 and LSTM 1, use the Nesterov-Adam optimizer with a learning rate of 0.002, with mean squared error and binary cross-entropy for the dual output of Protocol I where this dual output is weighted by 0.01 and 0.99, respectively, and categorical cross-entropy as loss function for Protocol II. Additionally, we use 250 epochs to train our models with data shuffling and validation ratio of 0.2.

Results
We present our results in separate tables for Protocol I (see Table 5 -Table 8) and Protocol II (see Appendix B). For each protocol, we split the results (i.e., f1 score and RMSE for Protocol I and f1 scores for Protocol II) based on the Nordic and the US datasets. Each of the tables contains the full headto-head comparison for the three handcrafted features sets for each of the nine different deep learning models separately. For instance, Table 7 contains f1 scores for the Nordic stocks based on Protocol I. The table has five main columns (i.e., Model, Stock, Econ, Tech-Quant, and LOB) and six subcolumns divided into three pairs (i.e., UnBal. and Bal.). The first main column contains the nine deep neural networks, and the second main column contains the five independent and different Nordic stocks where (a) This is Protocol I, where we test the three sets of features (i.e., Econ, Tech-Quant, and LOB), via nine deep learning models (i.e., five MLPs, two CNNs, and two LSTMs) for mid-price prediction. The mid-price prediction in this protocol is a combined prediction of when the next mid-price movement will happen and in which direction. This protocol is based on online learning architecture. We test this protocol for both US and Nordic stocks.
(b) This is Protocol II, where we test the three sets of features (i.e., Econ, Tech-Quant, and LOB), via nine deep learning models (i.e., five MLPs, two CNNs, and two LSTMs) for mid-price prediction. The mid-price prediction in this protocol is a three-class problem with states for up, down, and stationary mid-price movement. Protocol I predicts every 10th event from the current mid-price state. We test this protocol for both US and Nordic stocks. show the process for predicting the mid-price movement based on Protocol I and Protocol II, respectively. In both protocols, the first step is the choice of dataset. The ML trader has to choose the US or Nordic stock(s) (e.g., there is the option of making predictions based on a specific stock or choose the 'Joint' case where all the stocks from the US or Nordic markets used for training). The second step is to choose the feature set. The ML trader has to choose one of the three suggested feature sets, which are: the newly introduce econometric set, the one that is based on technical and quantitative indicators and the third one which is based on time-sensitive and time-insensitive features. The third step is whether the prediction should be based on a balanced or unbalanced set. The fourth step is the choice of one of the suggested nine deep learning models. The final step is the one that differs in Protocol I and Protocol II. The difference lies in the fact that Protocol I is a combined classification and regression optimization problem with zero event lag and Protocol II is a three-class classification problem based on a 10-event lag.
the sixth row for every model is the joint training set based on these five stocks, the third, fourth and fifth main columns represent the three handcrafted feature sets. Moreover, for every feature set, we present results for unbalanced and balanced cases, where for the balanced cases we use random undersampling for the majority class. Even though balanced datasets do not project a realistic trading scenario (i.e., trading fees are not applicable), it is important to give an equal opportunity to the minority class, which can be ML trader's trading position. More specifically, for Protocol I and the classification task, the Nordic dataset has 45% for the downward movement and 55% for the upward where for the US dataset is 47% for the downward movement and 53% for the upward. The undersampling offers an 85% data reduction for the Nordic set and 90% for the US set. For better interpretation of Protocol I we provide bar plots which show the reaction of every deep learning model and dataset for the unbalanced and balanced cases (see Fig. 5 and Fig. 6). Protocol II and the Nordic dataset exhibits a 75% for the stationary condition, with the rest 25% is equally divided to the upward and downward mid-price movement before undersampling. For the US dataset 73% belongs to the stationary condition, 20% for the upward movement and the rest 7% for the downward movement. The undersampling offers a 30% data reduction for the Nordic dataset and 10% data reduction for the US dataset.

Discussion
The conducted experiments reveal some interesting results for both experimental protocols and datasets selection. Both protocols forecast the mid-price movement, with Protocol I forecasting the mid-price  Table 4: List of the nine deep learning models that used for the two experimental protocols. Output, in the neural networks above, means that for Protocol I the output is a dense layer with 1 unit and linear activation function for the regression task and a dense layer with two units and softmax activation function and for Protocol II the output is a dense layer with three units and softmax activation function. movement every next event and Protocol I with a lag of 10 events. Protocol I provides more information regarding the high-frequency activity since it takes into consideration every trading event. This protocol is appropriate, as we can see from the results, for bigger datasets like the US (i.e., compared to Nordic which has smaller daily trading activity). Every one of the nine neural networks has to perform a dual   Note: Highlighted text shows the best RMSE performance for: 1) Joint/Unbalanced, 2) Joint/Balanced, 3) Stock-Specific/Unbalanced, and 4) Stock-Specific/Balanced cases task simultaneously, this of the regression and classification. Starting, from the Joint case where the full range of stocks used for training: we see that for the Nordic dataset the best f1 performance comes from MLP 3, for both unbalanced and balanced datasets, under the Econ feature set with 53% and 56% for the Tech-Quant set. This MLP did not perform well for the regression task where the RMSE was above 165.29. For the stock specific case: we achieve the best classification performance of 53% f1 score for Outokumpu Oyj under MLP 4 and the Econ feature set with RMSE of 98.44. This, stock-specific, performance of the MLP 4 is the best trade-off between classification and regression for the Nordic dataset. If we want to focus on the regression task only, we can choose the more advanced model, LSTM 2, with RMSE of approximately 24 for both unbalanced and balanced Tech-Quant feature sets for Kesko Oyj. For the US dataset, the new protocol presents more interesting results. For the Joint case, where both Amazon and Google used for training, the LSTM 2 achieves 59% f1 score and RMSE of 89.69 where for the stock specific case LSTM 1 under the Tech-Quant feature set achieves 58% f1 score and high RMSE of 123.36 for Google and the unbalanced case. If we focus only on the regression part, we can choose the entire MLP universe and the Econ feature set for Amazon and the Joint case. The newly introduced Econ feature set performed very well for the regression task also for LSTM 2 across the entire protocol for the unbalanced dataset. One more interesting observation is that the Econ feature set together with the shallower MLP 1 and the balanced set reports very low RMSE for Amazon, Google, and the Joint cases, respectively. That means that the Econ feature set, for the Amazon and Joint case, where able to predict that the mid-price will change its direction in a millisecond duration. Here, it is vital to report that the daily trading activity, for the US and Nordic stocks, contains several trades with the same timestamp/millisecond. Approximately 30% of the trades, in the US dataset, occur in a millisecond where this percentage for the Nordic dataset is 36%.
On the other hand, based on Protocol II and for the Joint case we achieve the best forecasting performance of 51% f1 for the Nordic dataset based on MLP 4 (i.e., which is one of our deeper suggested MLP architectures) under the Tech-Quant feature set and the unbalanced case. For the Joint case in the

Conclusion
In this paper, we extracted handcrafted features based on the econometric literature for mid-price prediction using deep learning techniques. Our work is the first of its kind since we do not only introduce a new feature set, based on econometrics, for the mid-price prediction problem, but also provide a fair comparison with two other existing state-of-the-art feature sets. Our extensive experimental setup, based on liquid and illiquid stocks (i.e., stocks are based on US and Nordic stock markets), sheds light to the area of deep learning and feature engineering by providing information based on online mid-price predictions. Our findings suggest that extensive analysis of the input signal can lead to high forecasting performance even with simpler neural network architects like shallow MLPs when advanced features can capture the relevant information edge. In particular, econometric features and deep learning predicted that the mid-price would change direction in a millisecond duration for Amazon and the Joint (i.e., training on both Amazon and Google) cases. Although these results are promising, our study here also suggests that the selection of features and models should be different for liquid and illiquid stocks.

Appendices
A. Feature Pool A.1. Statistical features • Mid price is defined as: • Financial duration is defines as: where T denotes the time instance at time t.
• Average mid-price financial duration is defined as: where T k N k=1 , and P k N k=1 are the partial cumulative sums of time and price differences for every LOB level for N samples.
• Mid price deeper levels are equal to: where l denotes the depth of the LOB.
• Log returns are defined as: where X i is the logarithmic price

A.2. Volatility measures
The features in this category aim to estimate, either the integrated variance (IV), that is the process or, more generally, the quadratic variation (QV) Here X is the logarithmic price of some given asset. We assume that X t follows an Itô semimartingale; that is, where b is locally bounded, σ is cádlág and predictable, and W is a standard Weiner process, ζ is a thin (i.e., finite) process mapping the jump size, and N is the counting process associated to the jump times of X. We define ∆ n the time elapsed between two adjacent observations; specifically, if we assume the observations are equidistant in time we have ∆ n = t n . As we do not work in calendar time we will have ∆ n = 1 n .
• Realized variance The realized variance (i.e., Andersen & Bollerslev (1998)) is the most natural estimator of the quadratic variation process and is equal to: • Realized kernel Realized kernels (i.e., Barndorff-Nielsen et al. (2008)) are used to obtain a noise robust estimate of QV as follows: with H the kernel bandwidth, γ h (X ∆n ) the autocovariation process, k is the kernel function of choice. In particular we use a non-flat-top Parzen and our implementation follows closely Barndorff-Nielsen et al. (2009).
• Realized pre-averaged variance The pre-averaged realized variance (i.e., Jacod et al. (2009)) is akin to the realized kernel estimator (in fact they are asymptotically equivalent). As for the realized kernel, the pre-averaged realized variance is used to retrieve a noise free measurement of the quadratic variation of our price process and it is calculated as follows: As before we have H the kernel bandwidth and θ the pre-averaging horizon. Further, given a nonzero real-valued function g : [0, 1] → R with g(0) = g(1) = 0 and which is further continous and piecewise continuously differentiable such that its derivative g is piecewise Lipschitz. Then, we define: (g (s)) 2 ds, ψ 1 = 1 0 (g(s)) 2 ds.
• Realized bipower variation The realized bipower variation (i.e., Barndorff-Nielsen & Shephard (2004)) measures the diffusive component of the price process, isolating it from the variation caused by the jump components and it is equal to: BV (X) t := π 2 n i=2 |r(X) i ||r(X) i−1 | (A.13) • Realized bipower variation (lag 2) BV (X) t := π 2 n i=3 |r(X) i ||r(X) i−2 | (A.14) • Realized bipower semivariance (+, −) Realized bipower semivariances (i.e., Barndorff-Nielsen et al. (2010)) are used to measure the upside and downside risk of the diffusive component: (A.15) • Jump variation We use a modified version of the jump variation estimator (i.e., Christensen et al. (2014)) which is both non-negative and consistent. As hinted by the name, the jump variation estimator provides a measures of the discontinuous variability component: • Spot volatility We only compute the spot volatility (i.e., Barndorff-Nielsen & Shephard (2002) and Andersen et al. (2010)) estimates on the block. The spot volatility measures the instantaneous volatility. The definition is consistent with the terminology commonly used in the literature on parametric stochastic volatility models in continuous-time: with h → 0 being the time interval upon which the measure is computed.
• Average spot volatility The average spot volatility provides an historical average of the estimated spot volatilities:

A.3. Noise and uncertainty measures
In this category, we incorporate two kinds of measures which are intimately linked to each other. We provide three different estimates for the integrated quarticity and two different estimates for the variance of the contaminating noise process. The integrated quarticity measures the degree of estimation error in the realized variance and can be consistently estimated through the realized quarticity estimators presented below for a fixed window size of 2000 events. The noise variance estimates provide a measure of the intensity of the noise process affecting the underlying price, as follows: with the noise variance estimates providing a measure of the contaminating: • Realized quarticity (i.e., Barndorff-Nielsen & Shephard (2006)): (X i − X i−1 ) 4 (A.20) • Realized quarticity Tripower The tri-power quarticity (i.e., Barndorff-Nielsen & Shephard (2006)) is a generalization of the realized bipower variation and is a consistent estimator for the integrated quarticity in the presence of jumps: |r(X) i | 4/3 |r(X) i−1 | 4/3 |r(X) i−2 | 4/3 (A.21) with µ p = E (|Z| p ),where Z denotes a standard normally distributed random variable.