A Hybrid Prediction Model Integrating GARCH Models With a Distribution Manipulation Strategy Based on LSTM Networks for Stock Market Volatility

Accurate prediction of volatility is one of the most important tasks in financial decision making. Recently, the hybrid models integrating artificial neural networks with GARCH-type models have been developed, and performance gains from the models have been found to be outstanding. However, there have been few studies of hybrid models considering the nature of the distribution of financial data. Distribution of volatility time-series is highly concentrated near zero, and such aspect can cause low prediction performance on the whole domain of probability density function because weights in the networks can be trained to obtain accurate prediction only for the high frequency region, that is, near zero. To overcome the challenge, we propose a new hybrid model with GARCH-type models based on a novel non-linear filtering method to mitigate concentration property of volatility. For the filtering, we utilize root-type functions that transform extremely left-biased and pointed distribution of original volatility to a volume-upped (VU) distribution shifted to the right. Long short-term memory (LSTM) is employed as the basic implementation model, and the realized volatility of S&P 500 is predicted using the proposed models. It is found that the proposed hybrid model (VU-GARCH-LSTM) obtains 21.03% performance gain with respect to the root mean square error (RMSE) against the mean performances of the existing hybrid models integrating LSTM with GARCH-type models. Furthermore, the proposed model improves prediction performance in the right domain region of label probability density by making the prediction distribution comparable to the label distribution.


I. INTRODUCTION
In the financial market, one of the most important issues is an exact prediction of stock market volatility. Volatility in finance presents generally the amount of fluctuation of financial time-series data and is used as risk measure since investment risk depends on volatility of underlying asset. Therefore, an accurate prediction of volatility is clearly an important issue for financial practitioners, such as investors, traders and risk managers, who want to gain stable profits.
The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
However, the accurate prediction of volatility has been still a challenging task because the volatility is an unobserved latent variable and has complex features as heavy tail and non-stationary behavior. Motivated by these, we study a new novel model to improve the accuracy of prediction of stock market volatility.
There have been statistical models for predicting volatility that has the time-varying characteristics. As the parametric model to describe the conditional heteroscedasticity of financial volatility, the ARCH (Autoregressive Conditional Heteroskedasticity) model was suggested by Engle [1]. Bollerslev [2] generalized the ARCH model with the VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ moving average terms and introduced the GARCH (Generalized ARCH) model. Since GARCH model was introduced, GARCH-type models have been improved by many researchers. GARCH model of Bollerslev has the limitation that can not capture the leverage effect. To overcome this limitation and to provide more accurate prediction of volatility, Nelson [3] and Zakoian [4] proposed the EGARCH (Exponential GARCH) and the TGARCH (Threshold GARCH), respectively. In addition, there have been various extensions of the GARCH model in modeling stochastic volatility, and various GARCH-type models have been developed by many researchers [4]- [8]. However, GARCH-type models are difficult to capture complex fluctuation and nonlinear correlation of financial time-series data. In this study, we overcome the shortcomings using long short-term memory (LSTM) approach which is one of non-parametric models and improve in accuracy of volatility forecasts. ANN is one of non-parametric models and can capture complex non-linear relationships between predictors and responses. Since ANN does not need the strict assumptions to predict time-series data, ANN has been used widely for the prediction of financial time-series. Hamid and Iqbal [9] predicted implied volatility from S&P 500 Index futures options using ANN approach and found that forecasts from ANN outperform implied volatility forecasts. Sermpinis et al. [10] used higher order neural network (HONN) approach to improve a forecasting performance of the realized volatility of the FTSE 100 futures. Xu et al. [11] suggested a novel UMIDAS-SVQR model combing unrestricted mixed data sampling (UMIDAS) and support vector quantile regression (SVQR) and showed that the model is superior to several models in terms of accuracy. Ramos-Pérez et al. [12] introduced a new model with the stacked algorithms based on a set of machine learning techniques and predicted S&P 500 volatility. In addition, Idrees et al. [13] predicted efficiently the Indian stock market volatility to construct an appropriate ARIMA model. Wang et al. [14] proposed a hybrid time-series predictive neural network (HTPNN) by learning the fusion feature of news and time-series to promote the prediction accuracy. Furthermore, there have been various hybrid models based on machine learning approach for predicting of financial time-series [15]- [17].
In recent years, the hybrid models that incorporate GARCH-type models into the ANN models have been proposed for better prediction of financial time-series. In general, outputs of GARCH-type models have been used as input of the ANN models, and the hybrid models have been called 'ANN-GARCH' models. More specifically, ANN-GARCH hybrid models have shown the improvements of prediction accuracy of financial time-series such as stock market indices, gold price, oil price, metals, exchange rate and cryptocurrency, etc [18]- [27]. That is, the results with ANN-GARCH hybrid models have indicated that the models improve the traditional GARCH-type models and ANN models in predictive ability.
Various combinations of GARCH and ANN models have been used to improve the prediction performance of stock market volatility. However, the performance on the right region of the target distribution (i.e. abnormal events) is still lacking, and the improvement of a breakthrough methodology for the challenge is still meager. The volatility has an extremely biased distribution to the left, which is a major impediment to improve the prediction performance. To address this, we propose a methodology for reducing the extremeness by manipulating the volatility distribution artificially. More specifically, for the prediction of realized volatility, we propose a hybrid model that combines GARCH-types models and LSTM with a novel data filtering methodology that manipulates distributions of input data. The followings are our contributions: • A novel data filtering methodology is proposed to alleviate the extreme bias of volatility distribution.
• The proposed method outperforms the benchmarked strategies (various combinations of GARCH-type models and LSTM) in whole domain region, with exceptional performance in the right domain region of the target distribution.
• The proposed method makes the distribution of prediction similar to the target distribution.
The remainder of this paper is structured as follows. In Section II, we review the basic models used in this paper. In Section III, we introduce the data for the prediction and propose new hybrid models by integrating LSTM with the GARCH-type models and the proposed strategy based on the distribution manipulation. In Section IV, we present the experimental results of the proposed hybrid models with some meaningful figures and tables. Finally, the conclusion is given in Section V.

II. MODELS
A. GARCH-TYPE MODELS 1) GARCH The ARCH model [1] is a method that describes explicitly the change in variance over time in stochastic time-series. It estimates current volatility of a time-series using a linear combination of error terms which involve the Gaussian random noise and past information of volatilities. ARCH (q) with order q is defined by where y t is a given stochastic time-series, and µ t is its drift. σ t , F t and N (0, 1) denote the volatility at time t, information up to time t and the standard Gaussian distribution, respectively, and all coefficients are set to be nonnegative. GARCH [2] is an extension of the ARCH model that supplements a moving average component to autoregressive component. GARCH (p, q) is defined by where all coefficients are positive. In (6), the volatility is predicted by taking the weighted sum of the predicted variance (second term) and observed volatilities (third term).

2) EGARCH
Unlike GARCH model, EGARCH allows the coefficients to be negative [3]. Consequently, the model can reflect the leverage effect that is an asymmetric information effects. Namely, a negative shock has greater effect on the volatility rather than a positive shock with the same magnitude. EGARCH (p, q) is defined by ln(σ 2 where α i , β i , θ and ω are parameters to be estimated. The function g(z t ) is linear in z t with slope θ + ω (or θ − ω) if z t is positive (or negative).

3) TGARCH
TGARCH has utilized flexible lag structure in the volatility keeping the GARCH tractability, and improved the complexity of EGARCH [4]. TGARCH (p, q) is defined by and TGARCH have some similarities since the models can detect asymmetries in volatility. However, EGARCH imposes a constant structure at all lags while different lags may yield opposite contributions in TGARCH.

B. LSTM
Recurrent neural network (RNN) is a class of artificial neural networks, which is designed for the prediction of sequential data including time-series data [28]. RNN consists of input, hidden, and output layers, and can unfold with a certain depth suitable to input dataset. Classical RNN has a defect called the vanishing gradient problem [29], and LSTM overcame the problem by modifying and improving its structure [30].
The feed-forwarding process of LSTM for the input data x t and hidden state h t at time-step t can be formulated as where W i and b i are weights and bias terms, respectively, . Functions σ and tanh are defined by respectively. The symbol * means the elementwise product of two vectors. Net flow in a LSTM block is illustrated in Fig. 1. It can be shown that the total number of parameters in LSTM is 4(K I K H + K 2 H ) where K I and K H indicate the dimension of x t and h t , respectively.

III. DATA AND PROPOSED STRATEGY A. EXPLANATORY VARIABLES
In this study, 16 financial indices are employed as the input explanatory variables. The variables are listed in Table 1. For each variable, the common daily closed values from 1st-Jan-2004 to 30th-Nov-2020 (the total size of data is 3179 × 19) are used. The variables are selected based on the works of [31]- [33], and downloaded from Yahoo Finance website. Except for VIX, we change each original price (closed price) into the square of log return. That is, we use y t defined by y t = (ln(R t /R t−1 )) 2 , where R t is closed price at time t. We employ realized volatility of S&P 500 as the label (target value). The realized volatility represents how much the stock price has changed during a certain period, and defined by where p, L t , and L are the period, the log return rate of S&P 500 at time t, and the averaged value of the log return rate during p days, respectively. In this study, we set p = 5. This means that RV t is estimated from the data of one-week trading days. S&P 500 price is used not only as an input variable but also as the target value, to take full use of its historical information. However, the realized volatility is only used in the label.

B. VOLUME-UP STRATEGY
In this subsection, we propose a novel method termed Volume-Up (VU) as a non-linear filtering of input data. The distribution of the volatility time-series is extremely concentrated close to zero, but the low-frequency events distributed to the right domain of the label probability density function (PDF) have a practical significance. However, because the general loss function in a neural network only measures the average of total errors and the network is learned to decrease the average, such network properties can degrade the prediction performance of the (low-frequency) events. VU has been developed to mitigate the nature of biased input distribution, and to shift the distribution to the right in the label PDF domain. In this study, we evaluate the relationship between the artificial mitigation of the biased aspect of volatility distribution using VU strategy and the improvement in prediction performances in terms of various error metrics. Let X and Y be two random variables, and let their corresponding PDFs and cumulative distribution functions (CDFs) be f X , f Y and F X , F Y , respectively. Let g(x) be an increasing function such that Y = g(X ). Then, we have where g(x) is a monotonic function. Differentiation on both side of (18) yields where F and g −1 denote derivative of F and inverse of g, respectively. The equation (19) implies that PDF of Y can be expressed explicitly in terms of PDF of X and inverse of g.
In this study, we apply a root type function to the function g in above calculations. That is, we define the function g as g(x) = x α , 0 < α < 1, where the number α is the hyperparameter of VU to be analyzed. In this study, the hyperparameter is tested with nine values α = i/10, i = 1, 2, · · · , 9. By direct calculation using (19), we can obtain In (20), for example, if α = 1/2, then f Y (y) = f X (y 2 ) 2 y . We state some remarks on VU. First, the function root-type function is intentionally chosen to make it similar trait of F X (x). It can be shown from (18) and (19) that if g(x) = F X (x), then the random variable Y has an uniform distribution. That is, applying CDF of X to g(x) makes PDF of Y to be flattened. Functions with CDF-like shaped would alleviate the extremely biased property of input-data distribution. Second, the function g(x) maps a narrower interval near zero into a wider interval near zero, and a wider interval far from zero into a narrower interval far from zero. Third, the distribution of Y = g(X ) is pulled to the right direction. Fourth, as the hyperparameter α approaches to 1 (from the left side), g(x) approaches to the identity filtering. Fifth, the decoding process is simple. The final prediction is obtainable by applying g −1 (x) = x 1/α to the output. Distribution changes of S&P 500 volatility by g(x) are illustrated in Fig. 2.

C. PROPOSED HYBRID MODELS: INTEGRATION OF GARCH MODELS, LSTM WITH VU STRATEGY
Numerous studies have shown that the incorporation of neural networks (NN) and GARCH-type models improves prediction performances of stock market volatility compared to applying GARCH-type models only [25], [31], [34]. Also, Kim and Won [35] reported that employing information extracted by multiple GARCH-type models as inputs yields better performance than by a single GARCH model. In this study, we propose a new hybrid model combining LSTM, GARCH-type models and VU strategy. In particular, to clarify the performance gains from the VU, we divide the experiments into the VU-applied cases and the native (non-VU) cases.  The rolling window method is employed for the experiments. The method continuously incorporates relatively recent observations to the model fitting and prediction, which can help the networks adapt well to the current economic situation. Fig. 3 shows the method in detail. We apply the method to outputs of GARCH-type models (as well as other explanatory variables) to make them the input of the proposed model. Based on works of [31], [36], we employ 9 models GARCH (1,1), GARCH (2,2), GARCH (3,3), EGARCH (1,1), EGARCH (2,2), EGARCH (3,3), TGARCH (1,1), TGARCH (2,2), and TGARCH (3,3) as the GARCH-type models to be tested. For the simplicity, we denote the predictions obtained from the models as G1, G2, G3, E1, E2, E3, T1, T2, and T3, respectively. We also present the residuals of all invited GARCH-type models for S&P 500 in Fig. 4.
In GET-LSTM, if we employ all predictions obtained by three GARCH models as input data, the input dimension would be different compared to other single GARCH based models. Also, the experiments have concerns about overfitting caused by a large input dimension. In order to avoid the problem, we use Bayesian information criterion (BIC) to select only three of the nine inputs. BIC is an estimator of prediction error for statistical model selection, and defined by BIC = k ln(n) − 2 ln(L) where k is the number of estimated parameters in the model,L is the maximum value of the likelihood function for the model, and n indicates the number of sample size. For a fixed window size W , we select three of the nine input candidates with the lowest BIC values, and use them as the input data of GET-LSTM.

IV. EXPERIMENTS AND RESULTS
A. EXPERIMENTS 1) SETUP As described in Section II-B, the total number of parameters in LSTM is 4(K I K H + K 2 H ) where K I and K H denote input dimension and hidden dimension (dimension of each gate), respectively. The input dimension K I for each experiment is presented in Table 2. The hidden dimension is set to be K H = 4 which has been found as the best number in between over-fitting and under-fitting conditions. The RELU function [37] defined by RELU(x) = max(x, 0) is set to be located in the output layer as an activation function. Batch size and epochs are set to be 512 and 200, respectively. Linear decreasing learning rate (from 0.005 to 0.0005 as . Residuals analysis for GARCH models. The first, second, and third columns represent residuals, the quantile-quantile plots (Q-Q plots), and the residual distributions for all invited GARCH models. epochs increase) is employed. The ratio of train, validation, test of dataset is chosen to 6:2:2. The Adam optimizer [38] is applied as an optimization tool, which combines the gradient descent with momentum [39], moving average, and RMSprop [40]. Mean square error loss function defined by 2 is used as the loss function where N , y,ŷ denote number of samples, label, predictions, respectively. We used the uniform normalization method, which assigns maximum and minimum of data to be 1 and 0. Also, the uniform parameter initialization on [-0.2, 0.2] is chosen. For the hyperparameters (hidden dimension for each gate, depth of LSTM, learning rate, epochs, and batch size) selection, we use a mixture of random search method and grid search method. The learning rate, batch size, and epochs are firstly extracted using the random search method. And then, using the grid search method, we extract the other hyperparameters whose candidates are selected near values that performed well in the random search step. All hyperparameters are selected so that the error induced from the validation dataset can be minimized.

2) ERROR METRICS
The performances are measured with four error metrics. Root mean square error (RMSE) (defined by 1 where N , y,ŷ denote number of samples, label, predictions, respectively) and correlation coefficient (CC) (defined by cov(y,ŷ)/(σ y σˆy) where cov(x, y) indicates the covariance between two variables and σ X is a standard deviation of X ) are employed as basic tools for error measurement. In addition, we adopt the area of the overlapping section of two distributions (those of the label and predicted value) as an error metric. Although a larger common section is not a necessary and sufficient condition for a better prediction, we assume that the overlapping distribution likely leads to performance  gains in this study. RMSE on the extreme region is employed as another error metric to evaluate the prediction performance of rapid changes in financial indices. The extreme region is defined by [l, ∞) where P(l < x) = 0.2. In other words, the extreme region corresponds to the rightmost 20% of the probability.

1) DETERMINATION OF WINDOW SIZE
In this study, the performances are tested using 16 values of window size W (W = 30 × i, i = 1, 2, · · · , 16) and evaluated using four error metrics. Also, to determine input data in GET-LSTM model, we investigate BIC values for each of 9 GARCH models (Table 3), and choose lowest three of them for each window size.
As mentioned in Section III-B, nine values of the hyperparameter α (α = i/10, i = 1, 2, · · · , 9) are tested to evaluate the VU strategy for each experiment. We firstly define a 'positive section' as the set of values of α at which the performance of VU is better than the native (non-VU) strategy. According to the results in the next section, the positive section accounts for about 70% of the possible α-region, that is, the interval (0,1) for each implementation model and error metric. To determine the window size, we present the prediction performances by comparing the proposed model and native experiments for each implementation model and error metric in Fig. 5.
According to Fig. 5, the performance at W = 180 (trading days) outperforms the other cases in general. This aspect means that variations of log-return for the S&P 500 dataset are more influenced by the relatively recent observation than by historical information longer than about 180 trading days. We present the best performance (for each model and error metric) and corresponding window size in Table 4. For a more detailed evaluation of the proposed model, we focus on the case of W = 180 in the next subsection.

2) ANALYSIS FOR WINDOW SIZE OF 180 TRADING DAYS
In this section, fixing the window size to be 180 trading days, we evaluate the proposed hybrid model against the native model in terms of three measurements: 1) Mean performance over all values of hyperparameter α, 2) Mean performance for α in the positive section, 3) Best performance corresponding to a specific α.
In Fig. 6, we compare the performances according to GARCH models, error metrics and hyperparameter α, 0 < α < 1. For RMSE, the performance gains of G-LSTM, E-LSTM and T-LSTM against P-LSTM are 6.9%, 7.3% and 10.8%, respectively. Furthermore, the performance gain of GET-LSTM against P-LSTM is 12.5%. This result shows that the addition of informations obtained from multiple GARCH models to the explanatory input data improves performances even more compared to the case of adding single GARCH model information. In the VU-applied case, we find that the performance gain is outstanding compared to the existing model (non-VU), the corresponding results are presented in Table 5 and Tabel 6. In general, VU-GARCH models based LSTM (VU-G-LSTM, VU-E-LSTM, VU-T-LSTM and VU-GET-LSTM) show better performance in the positive section and lower performance outside the positive section compared VOLUME 10, 2022 FIGURE 5. Intercomparison of performances according to GARCH models, error metrics, and window size (x-axis in each panel). Black indicates the performance obtained by the native strategy (Non-VU). Solid pink, dotted pink, and red denote VU-performances on the whole α-regions, on the positive section, and for a specific α value (best case), respectively. Gray shaded regions indicate the vicinity of window size 180.

TABLE 4.
Window size corresponding to the best performance for each error metric. to VU-P-LSTM ( Fig. 6 and Table 6). The performances in terms of other error metrics (CC, overlapping area, and RMSE on extreme region) show similar trends to RMSE. The performance gains of VU-GET-LSTM against other models are presented in Table 7.  implies performance gain from VU strategy is greater than the gain from GARCH model for all error metrics. In addition, it is not able to find that VU-G-LSTM, VU-E-LSTM, and VU-T-LSTM have the model's superiority. However, the proposed VU-GET-LSTM model outperforms the other models.
Besides the evaluation with respect to the overlapping area (over all region of label PDF domain), it is necessary to investigate the overlapping region near the region because most prediction and label are concentrated near zero. Fig. 8 shows comparative distribution (near zero) of P-LSTM, GET-LSTM, and VU-GET-LSTM. It can be confirmed that the distribution of prediction obtained from VU-GET-LSTM model become closer to that of label (realized volatility) compared to other native models.
The proposed hybrid models do not show better prediction performance for the hyperparameter α near zero (about 0 < α < 0.3) compared to models with native strategies. However, based on the facts that the positive sections occupy VOLUME 10, 2022   wider domain than the non-positive sections for all error metrics, we conclude that the models with VU outperform the native models. Based on these results, we now summarize and discuss our experimental results. First, the positive section occupies more than 70% of interval (0, 1) for each error metric and GARCH integrated model. Also the intersection of  Fig. 6 contains (0.3, 1). That is, excluding root type functions that increase rapidly near zero (the functions correspond to g(x) = x α in (18) for 0 < α < 0.3), the hybrid models with VU outperform the native models for most root type functions. Second, improvements in RMSE on extreme region show that VU is an appropriate strategy for predicting time-series that involve large volatilities. It is expected that VU can be well applied in regression analysis to detect abnormal events occurred in flood, financial crisis prediction, and so on. Third, the positive section in error metric CC (0.1 < α < 1 in Fig. 6) occupies the widest region compared to other error metrics. It can be estimated that VU enhances prediction performance by making similar moving trend of label and prediction. Meanwhile, it is found that VU increases the sum of all correlation coefficients between all input indices ( Fig. 9 and Table 8). However, higher correlation coefficients of input do not guarantee better performances. For example, correlation coefficients correspond to α=0.2 and α=0.6 are similar (128.7 and 128.6, respectively in Table 8), however, according to the Fig. 6, the performance at α=0.6 is better than at α=0.2 for all error metrics. It is noteworthy that α=1 in Table 8 corresponds to the native (Non-VU) case. Fourth, as α approaches to 1 from the left side, the performance of proposed algorithm is reduced to that of the native strategy. It is speculated that a filtering approach using concave-up functions (instead of convex functions) is an appropriate filtering strategy. Fifth, Table 9 shows that when the VU strategy is applied, the length of the confidence interval of the prediction is shorter (rather than non-VU case), and it is also shorter on the positive section than on all αregions. This aspect demonstrates that the results obtained on the positive section are more reliable. Sixth, in Fig. 4, it is found that input data induced by various GARCH models have outliers which usually affect model performance. However, the LSTM is used as the implementation model in this study, and the sigmoid function and the tangent hyperbolic function are located in the LSTM cell (from (12) to (17)). The functions have the squashing effect, and prevent outliers of input data from having an unduly large and disturbing effect on the learning [41].

V. CONCLUSION
The distributions of various financial data including the volatility of S&P 500 index are extremely biased to the left and are concentrated near zero. This aspect can cause low prediction performance on the right part of distribution (abnormal events) as well as on whole probability density domain. Meanwhile, incorporation of various explanatory variables with GARCH-type models has been reported to improve performance by adding reliable information to input data in neural networks. To overcome the challenge of the former with the advantage of the latter, we develop the hybrid model using VU strategy to alleviate the biased property of financial data and to obtain better performance gain. Specifically, we use the outputs of GARCH-type models as input as well as the explanatory variables and combine VU strategy. The concave function x α , 0 < α < 1 is employed in VU strategy, and it plays a crucial role in mitigating biased property of volatility.
In general, it is known that the current S&P500 index is highly affected by relatively recent observations rather than historical data far from the present time. Thus, we adopt the rolling window method to obtain input of sequential structure. From the proposed model, we obtain several meaningful results. Firstly, performances at a specific window size (180 trading days in this experiment) are superior compared to performances at other window size for each error metric. Secondly, performance gains obtained from the proposed model are conspicuous compared to the gains obtained from GARCH based models as well as the native strategy. Especially, performance gains are noteworthy when the hyperparameter α is restricted on the positive section that occupies more than 70% of the interval (0, 1). Thirdly, the proposed model enhances performance on the right region as well as other region of probability density. This shows that the proposed algorithm is an appropriate approach to predict abnormal events in financial market. Finally, we expect that modifications of the concave function will improve prediction performance further, and this will be considered in the future work.