Explainable Time-Series Prediction Using a Residual Network and Gradient-Based Methods

Researchers are employing deep learning (DL) in many fields, and the scope of its application is expanding. However, because understanding the rationale and validity of DL decisions is difficult, a DL model is occasionally called a black-box model. Here, we focus on a DL-based explainable time-series prediction model. We propose a model based on long short-term memory (LSTM) followed by a convolutional neural network (CNN) with a residual connection, referred to as the LSTM-resCNN. In comparison to one-dimensional CNN, bidirectional LSTM, CNN-LSTM, LSTM-CNN, and MTEX-CNN models, the proposed LSTM-resCNN performs best on the three datasets of fine dust (PM2.5), bike-sharing, and bitcoin. Additionally, we tested with Grad-CAM, Integrated Gradients, and Gradients, three gradient-based approaches for the model explainability. These gradient-based techniques combined very well with the LSTM-resCNN model. Variables and time lags that considerably influence the explainable time-series prediction can be identified and visualized using gradients and integrated gradients.


I. INTRODUCTION
Researchers are employing time-series analysis in various fields such as the prediction of energy [1], [2], stock price [3], [4], and temperature [5]. Nowadays, the timely prediction of COVID-19 patients is also increasing [6], [7]. Conventional time-series methods-such as auto-regressive integrated moving average (ARIMA) -are modeled based on stochastic assumptions [8]. However, such probabilistic models' stochastic assumptions are inapplicable to large, real-world time-series data. Based on six forecasting competitions from Kaggle, ensemble models using cross-learning tend to outperform traditional time-series models, and both machine learning (ML) and deep learning (DL) models showed promising performances [9]. The ML and DL models can overcome the shortcomings of conventional models because of their additional degrees of freedom with more estimable parameters. In particular, DL models employ The associate editor coordinating the review of this manuscript and approving it for publication was Tingwen Huang.
numerous parameters for the model. Such models require extensive data to compensate for the numerous parameters and achieve enhanced performance. However, DL models also show disadvantages in terms of explainability. DL models are considered as black box models because elucidating their predictions is challenging.
ML methods have been long studied for addressing challenging time-series problems [1], [10]. The decision tree regression (DTR) and support vector regression (SVR) are ML algorithms that can be applied to time-series analysis [11]. A decision tree is a model that is used to classify data based on a set of decision rules and is a supervised learning model that can be used to perform both classification and regression. Support vector machine belongs to the disciplines of ML and is a supervised learning model that can be primarily used to perform classification tasks. SVR is also applicable to regression problems based on kernel functions. Moreover, it is actively studied for time-series predictions [12]. The SVR model achieved excellent prediction performance among various ML methods on KEPCO data. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In a recurrent neural network (RNN), past information is stored in the hidden layer. However, RNN models suffer from a long-term memory dependency problem in which longterm memory cannot be transferred to deep layers [13]. Long short-term memory (LSTM) model [14] can be used to modify the RNN structure to improve the learning of long-term dependence. Information from the previous stage is stored in a memory cell, enabling the transfer of long-term past information. Furthermore, bidirectional LSTM (BiLSTM) is an extension of the convention LSTM model that additionally considers the reverse direction. Because two directions are considered in BiLSTM, the number of parameters is also twice that employed in unidirectional LSTM. Gated recurrent units (GRUs) [15] are used to simplify the LSTM structure and offer the advantage of faster learning speed and fewer parameters compared with LSTM. However, compared with LSTM, GRU-based BiLSTM does not always achieve better performance; hence, based on the requirements, an appropriate model selection must be performed.
Similar to RNN models, convolutional neural network (CNN) models can be used to handle time-series data. Typically, a one-directional (1D) CNN is used to extract features by moving along only the time axis, unlike two-dimensional (2D) CNNs, which extract features by moving along two directions using a square filter. The dilated causal convolution of WaveNet [16] involves a combination of causal and dilated convolutions. Using the dilated network, the model can achieve a wide acceptance range.
In a CNN followed by an RNN (CNN-RNN) model, the strengths of an RNN and a CNN are combined. CNN-RNN models have been employed widely because CNN simplifies the feature space, and the long-term dependency is modeled using LSTM or a GRU [17], [18], [19]. In an RNN followed by a CNN (RNN-CNN) model, an RNN and a CNN are combined in the opposite direction [20], [21], [22].
Deep learning models make use of a lot of variables and are challenging to analyze. It is also known as a ''black box model,'' and numerous experiments on interpretable AI are being done to add explanatory power. Integrated Gradient (IG) [23] synthesizes gradient values surrounding provided data to look for significant variables. Grad-CAM [24] also utilizes gradients to display class activation maps. MTEX-CNN [25] was suggested for time series analysis, and Grad-CAM was used on the 1D CNN-based MTEX-CNN model. However, the MTEX-CNN design solely uses 1D CNN layers, and 1D CNN layers cannot simulate long-term dependency.
We propose an LSTM-resCNN, LSTM-CNN with an additional residual connection on 1D CNN layers in this study. We included a residual connection to ensure that no critical information is lost in the process of three consecutive 1D convolution layers. To add explanatory power, we experimented with applying IG and Grad-CAM methods. The proposed LSTM-resCNN achieves better prediction performance than 1D CNN, BiLSTM, LSTM-CNN, and CNN-LSTM models on three datasets. Moreover, the application of Grad-CAM to LSTM-resCNN adds explanatory power as the Grad-CAM activation visualizes temporal features that significantly influence the prediction. The proposed LSTM-resCNN can achieve high predictive performance with additional interpretability on three datasets: fine dust (PM 2 .5), bike-sharing, and Bitcoin. The key contributions of this study are summarized below.
• The Grad-CAM endows the proposed LSTM-resCNN with gradient-based explainability. We can identify time zones that play an essential role in prediction.
• To determine important variables in predictions, Gradients and Integrated Gradients are employed.
Related studies are introduced in Chapter 2. The model put out in this study is established in Chapter 3. In chapter 4, the proposed model was tested against three different datasets, and our suggested LSTM-resCNN was contrasted against other available techniques. Finally, the paper ends with discussion and conclusion section (chapters 5 and 6).

II. RELATED WORKS A. HYBRID MODEL FOR FORECASTING
Many studies have shown that CNN-LSTM, which comprises a combination of a CNN layer followed by an LSTM layer, shows better prediction performance than single CNN or LSTM models [17], [18], [19]. Using PM2.5 data for multistep prediction experiments, CNN-LSTM showed better prediction performance than LSTM [17]. In gold price predictions, CNN-LSTM outperformed support vector regression (SVR) and LSTM in terms of prediction accuracy [18]. In predicting house energy consumption, CNN-LSTM achieved the optimal prediction accuracy among ML techniques of linear regression, DTR, random forest regression, and multilayer perceptron [19]. In the Covid 19 hotspot prediction, CNN-LSTM outperformed SVR, ARIMA, and LSTM. [26]. Compared with DL techniques such as LSTM, GRU, BiLSTM, and attention-LSTM, CNN-LSTM achieved the optimal prediction accuracy. In the prediction of solar power generation, LSTM-CNN model was used to extract temporal and spatial feature information from data, As a result of the comparison with LSTM, CNN, CNN-LSTM, and LSTM-CNN, LSTM-CNN had the best predictive performance and took less training time than CNN-LSTM [21]. Moreover, based on this finding, research on gold price prediction revealed that the LSTM-CNN model has an advantage over the CNN-LSTM model (in which a CNN layer is applied as the first layer) [20]. This is because LSTM, the first layer of the LSTM-CNN model, influences the recording of the order of gold prices and the CNN layer optimizes the original input data to extract local features. In the CNN-LSTM model, the CNN layer can confuse the order of gold prices to some extent when extracting the pattern and thus cannot provide the advantage of LSTM when disorganized data are provided as inputs.

B. TIME-SERIES INTERPRETABILITY WITH GRAD-CAM
In MTEX-CNN, a two-stage model structure is used to explain both time dimensions and variables [25]. In stage 1, a 2D CNN is employed. To determine the influence of variables on each prediction, an asymmetric CNN is applied with a kernel size of k × 1. In stage 2, a 1D CNN is used, which considers only the time axis. Thus, the interaction between time and predictor can be explained by focusing on the temporal aspect of the data. Hence, MTEX-CNN can provide spatiotemporal explanations for multivariate time-series data based on the combination of Grad-CAM outputs.

III. METHODOLOGY
In this section, the proposed LSTM-resCNN model is explained. Further, other DL models used for experimental evaluations and comparisons are discussed.

A. OVERVIEW
The LSTM-resCNN model is a time-series prediction model that is an integration of existing LSTM and a 1D CNN with a skip connection. It offers the advantage of the model's explainability by applying Grad-CAM to the last layer of the CNN. Fig. 1 presents the structure of the LSTM-resCNN model. The model's input is a fixed-size time-series sequence, and the output is a future value (one-step prediction) or a set of future values (multi-step prediction). The temporal and spatial features are extracted from the LSTM module and two CNN blocks, respectively. The advantage of this architecture is that we can directly apply the Grad-CAM method to the last 1D CNN layer. We can identify time zones that are instrumental in predictions using the Grad-CAM method. Moreover, variables and time point that significantly impact predictions can be determined using the Gradients (G) or IG techniques.

B. LSTM
Existing RNNs show considerably degraded learning ability in long-range relationships due to a gradual reduction in gradients during backpropagation. Such a phenomenon is called the vanishing gradient problem, and LSTM is designed to overcome this problem [14]. LSTM can read unknown and VOLUME 10, 2022 persistent patterns between important events in time-series data.
The LSTM module comprises a cell (c t ), an input gate (i t ), an output gate (o t ), and a forget gate (f t ). The mathematical expression and graphical explanation of the LSTM module are described in the following and Fig. 2, respectively.
where, x t denotes an input vecter, h t represents an output vector and W T xf , W T hf , W T xi , W T hi , W T xo , W T ho , W T xg , and W T hg are estimable parameters related to the LSTM module.
Cells (c t ) memorize values over arbitrary time intervals, and the three gates regulate information flow in and out of the cell. The forget gate is controlled by f t and determines the part of the long-term state of c t to be deleted. The input gate is controlled by i t and controls the part of g t to be added to the long-term state of c t . The output gate o t determines the part of the long-term state of c t to be read and output to h t and y t . The three gates-called gate controllers-based on the sigmoid activation function close the gate when the output is 0 and open the gate when the output is 1.

C. 1D-CNN
The most commonly known CNN is a 2D CNN mainly employed in image processing. Alternatively, a 1D CNN is used in natural language processing and time-series analysis and it extracts features by moving the kernel along the time axis [15]. The 1D CNN layer contributes to the extraction of features inherent in the data, and the number of filters determines the number of traits to be learned, which is consistent with the dimension of the output space.

D. PROPOSED LSTM-resCNN WITH GRADIENT-BASED APPROACHES
In this paper, we suggest an LSTM-resCNN, which is an LSTM-CNN with an additional residual connection on 1D CNN layers. The reason for taking the order of LSTM-CNN rather than CNN-LSTM is to place the convolution layer to which Grad-CAM can be applied in the latter part to represent more complex features. We included a residual connection to avoid losing important information while applying three successive 1D convolution layers [27]. Also, residual networks demonstrate that they can be viewed as an assortment of numerous paths of different lengths, and these paths exhibit ensemble-like behavior [28].

1) MODEL DESIGN
The propose LSTM-resCNN model described in Table 1 and Fig. 1. Each CNN block comprises three 1D CNN layers with a filter size of 64 each and kernel sizes of 3, 2, and 1. The filter size of 64 was selected among 32, 64, and 128 based on the hyperparameter search. A residual connection is added to each CNN block.
After applying the two CNN blocks, a flatten layer is applied to convert the features into one-dimensional vectors. Next, a dropout layer is used to prevent overfitting. Finally, the result is transferred to the dense layer. The filter sizes are 1 and 24 for one-step and multi-step prediction models, respectively (predictions of 1 and 24 h, respectively).

2) GRAD-CAM APPLICATION TO LSTM-resCNN MODEL
Another advantage of using the RNN-CNN structure instead of the CNN-RNN structure is the application of Grad-CAM. The Grad-CAM technique has been proposed to visualize the part of the input feature that can be distinguished using the CNN model [24]. Thus, we use the activation gradient of the last CNN layer, which models abstract and complex high-level features [29]. Structurally, the CNN layer in the RNN-CNN structure is deeper than that in the CNN-RNN structure. This implies that we can visualize the parts related to important high-level features for time-series predictions.
The proposed LSTM-resCNN model is defined as f (X ), where X denotes the time-series matrix with T × J inputs (T and J represent the time window size and number of features, respectively). The result of the model y = f (X ) is the value predicted using the input time-series matrix X . To apply the Grad-CAM on a specific CNN layer, we separate the LSTM-resCNN model f (·) with h(·) and g(·)-f (X ) = h(g(X )), where A = g(X ) is the 2D activation matrix of a specific convolution layer. The activation A shows temporal and filter dimensions. Let A k i denote an activation value of the i-th temporal component and k-th filter. The gradient amount of ∂y c ∂A k i can be used to quantify the importance of the given position. Each filter can capture complex time-series data, and the gradient amount of y with respect to A can be used to quantify the importance of the given position. To obtain the overall filter importance weights q k for the k-th filter: By taking q k as weights for the k-th channel activation matrix (A k ), we can estimate the Grad-CAM as follows: where ReLU denotes a rectified linear unit and A k represents the k-th channel activation matrix from the activation tensor A. L 1D-Grad-CAM denote the weighted average activation of the activated parts from each filter. Thus, the plot of L 1D-Grad-CAM as a function of the input time-series x can yield the important parts for the prediction. Herein, the specific CNN layer is the activated last 1D CNN layer for the LSTM-resCNN, LSTM-CNN, and CNN-LSTM models. Note that the Grad-CAM is applied to a deeper layer in the LSTM-resCNN and LSTM-CNN models, which embed more complex features.

3) IG AND G FOR LSTM-resCNN MODEL
A disadvantage of Grad-CAM (L 1D-Grad-CAM ) is that it can only visualize the temporal importance. To achieve the feature-level importance, a naive approach will involve the estimation of ∂y ∂X . Because the input matrix X include both temporal and feature dimensions, we can quantify both the temporal-and feature-level importance by visualizing G.

G = ∂y ∂X
A large ∂y ∂X 0 indicates a high importance. However, this naive gradient break sensitivity [23]. Sundararajan et al. [23] proposed IG to overcome the drawbacks of the G method. We employed the integrated gradient technique for visually explaining the feature-level importance of the time-series models such as CNN-LSTM, LSTM-CNN, and LSTM-resCNN. Let X be the time-series matrix with T × J inputs and X be the baseline input matrix. The IG for timeseries data can be defined as Computationally, it can be approximated as: where m denotes the number of steps in the Riemman approximation of the integral. Note that the resulting IG matrix is a T × J matrix, similar to the input time-series matrix.
We can approximately estimate the feature-and temporallevel importance using the IG matrix.

IV. EXPERIMENTS
A. DATASET 1) PM 2.5 The PM 2.5 dataset [30] is used herein to predict the fine dust concentration in Beijing using seven variables: dew point, temperature, pressure, wind direction, wind speed, snow cover, and precipitation. The variables were recorded every hour from January 1, 2010 to December 31, 2014. The total data length is 43,800, in which missing values are replaced by zero. The training, validation, and test data are divided at a ratio of 8:1:1 [31]. The test data are used only for model performance evaluations.

2) BIKE-SHARING
A bike-sharing dataset comprising data collected from Seoul, Korea [32], [33] is employed to predict the bike-sharing demand in Seoul. Since 2017, bike sharing has been increasing in Seoul because of convenient accessibility and nearzero waiting time. For a stable bike-sharing supply, the supply requirement information must be predicted for each hour. The hourly usage data are recorded from December 1, 2017 to November 30, 2018, and the total data length is 8760. The temperature, humidity, wind speed, visibility, dew point temperature, solar radiation, rainfall, and snowfall are used as explanatory variables. The missing values are replaced with the data of the same time a week ago, and the demand for bike sharing is considered as a dependent variable. The training, validation, and test data are divided at a ratio of 8:1:1.

3) BITCOIN
The Bitcoin dataset [34], [35] is used to predict its closing price. Nowadays, Bitcoin has been garnering considerable attention and the number of people interested in Bitcoin investment is increasing. Consequently, attempts to predict Bitcoin prices are increasing. The Bitcoin dataset comprises hourly data from December 12, 2019 to February 11, 2021, and the total data length is 10005. The Bitcoin closing price is deemed as a dependent variable. The opening, high, and low prices as well as volume-to, and volume-from are explanatory variables. Volume-to refers to the volume of currency being traded, and volume-from indicates the volume of base currency that things are traded into. The training, validation, and test data are divided at an 8:1:1 ratio.

B. PREPROCESSING
In DL, if the difference in the data scale of features are large, the model yields poor performance. Herein, the min-max normalization of 1 is performed on the training data.
Thereafter, the data shape is modified by cutting it based on the length of time to be predicted. In the one-step prediction, in which the time after which the data of the same day and time can be included in the training data is predicted at least once, the prediction of the next hour is performed using the data of the previous 24 h (Fig. 6(a)). In the multistep prediction, in which multiple time points are simultaneously predicted using the previous data, the prediction of the next day is performed using the data of the previous 168 h [36], i.e., one-week data ( Fig. 6(b)).

C. MODELS
We compare the 1D CNN, BiLSTM, LSTM-CNN, CNN-LSTM, and MTEX-CNN models with the proposed LSTM-resCNN model. Fig. 3(a) presents the architecture of the studied 1D CNN model. The input sequence is transferred to one LSTM layer with a filer size of 64 and six sequential 1D CNN layers with a filter size of 64 and kernel sizes of 3, 2, 1, 3, 2, and 1. Then, 1D maxpooling with a size of two is applied and a flatten layer is applied. One output dense layer is applied for the one-step (with a filter size of 1) or the multi-step (with a filter size of 24) predictions.

2) BiLSTM
In BiLSTM, an additional LSTM layer is introduced that performs processing in a direction opposite to that of the existing LSTM layer. The final hidden state outputs a vector concatenating the hidden states of the two LSTM layers. Moreover, the number of parameters required is twice that required in unidirectional LSTM. Fig. 3(b) presents the BiLSTM architecture used herein. The input feature transfers one BiLSTM with a filter size of 64. Then, the additional output dense layer is applied for the one-step (filter size of 1) or multistep (filter size of 24) predictions.

3) MTEX-CNN
The MTEX-CNN model is taken from Assaf et al. [25] and shown in Fig. 3(c). The input sequence uses three sequential 2D CNN layers with filer size 16, 32,1 and kernel size (8,1), (4,1), 1. After that, a 1D CNN with a filer size of 64 and a kernel size of 3 is applied, and a flatten layer is applied. We used one output dense layer for one-step (filter size of 1) or multi-step (filter size of 24) predictions. Fig. 4(d) shows the CNN-LSTM architecture used herein. The input sequence is transferred to six sequential 1D CNN layers with a filter size of 64 and kernel sizes of 3, 2, 1, 3, 2, and 1. Then, 1D maxpooling with a size of two is applied and one LSTM layer with a filter size of 64 transferred. One output dense layer is applied for the one-step (filter size of 1) or multistep (filter size of 24) predictions. Fig. 4(e) shows the LSTM-CNN architecture used herein. The input sequence is transferred to one LSTM layer with a filter size of 64 and six sequential 1D CNN layers with a filter size of 64 and kernel sizes of 3, 2, 1, 3, 2, and 1. Then, the flatten layer and one dense layer are added to achieve the results.

D. METRIC
The metrics of the root-mean-square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) are used for evaluating prediction performance; these metrics are commonly used analyzing the difference between predicted and actual values. When the values of the three metrics are small, the difference between the predicted and actual values is small, indicating enhanced prediction performance. A dropout with rate parameter 0.5 is applied before the last dense layer in all models. We train 100 and 30 epochs for the one-step and multistep predictions. The Adam optimizer [37] is used.

1) ONE-STEP PREDICTION
A commonly known prediction technique is one-step prediction [36], in which the next one-time point is predicted. To quantitatively evaluate the performance of the proposed LSTM-resCNN model, its prediction performance was compared with those of the BiLSTM, 1D CNN, CNN-LSTM, LSTM-CNN, and MTEX-CNN models. Furthermore, the VOLUME 10, 2022  same number of parameters is used in each model for a fair performance comparison based on the number of parameters in the models. Table 2 presents the training time, evaluation time, and the number of parameters used in each model. The difference between the actual and predicted values for each model is measured on the testing data, and the results are shown in Table 3

2) MULTI-STEP PREDICTION
Multistep prediction is useful because it can be used to establish countermeasures based on predictions for the next day, month, and year from now. Among the various methods   available for multistep predictions, recursive, direct, and multi-input multioutput (MIMO) methods [17], [36], [38] are most commonly used. In the recursive method, the first time point prediction in the training data is included and retrained. VOLUME 10, 2022  In the direct method, multiple time points are predicted using independent models for each step. MIMO is a single model that takes multiple inputs and predicts multiple outputs using a single model. The MIMO method is adopted in this study. It offers the advantage of preventing the accumulation of prediction errors. Fig. 7 shows a comparison of the actual and predicted values for 24-h data using the proposed LSTM-resCNN model. The black line represents the input data for predicting the next 24-h data. Red and blue points indicate the actual and predicted values, respectively. The bike-sharing dataset shows periodic weekly patterns. CNN-LSTM failed to record such picks. However, LSTM-CNN and LSTM-resCNN could capture such picks, with the LSTM-resCNN model outperforming the LSTM-CNN model. The PM2.5 and Bitcoin datasets do not show periodic patterns. Consequently, LSTM-resCNN shows better performance than CNN-LSTM and LSTM-CNN.
For the quantitative evaluation of the multi-step prediction model, 15 non-overlapping random time points are selected. Table 4 shows the average MAE, RMSE, MAPE values with standard deviations for each model. The highest-performing values are indicated in bold. The proposed LSTM-resCNN model showed the best performance in all cases, except for the MAPE value on the bike-sharing dataset. However, no statistically significant differences were observed between the proposed and second-best-performing models (two-sample t-test with Bonferroni adjustment; all p values > 0.1).

G. INTERPRETABILITY USING GRAD-CAM 1) GRAD-CAM
To evaluate the performance of Grad-CAM with LSTM-resCNN, we considered four one-step prediction models: MTEX-CNN, CNN-LSTM, LSTM-CNN, and LSTM-resCNN. Here, Grad-CAM is applied to the last CNN layer of the MTEX-CNN, CNN-LSTM, LSTM-CNN, and LSTM-resCNN models. We extracted the Grad-CAM activation features and normalized them between values of 0 and 1. Fig. 8 presents the Grad-CAM activation of randomly selected periods on the bike-sharing, PM 2.5 , and Bitcoin datasets using the MTEX-CNN, CNN-LSTM, LSTM-CNN, and LSTM-resCNN models. The activation value closer to 1 FIGURE 9. Feature importance. VOLUME 10, 2022 is represented by light colors, and it implies more influence on the prediction. The results obtained using MTEX-CNN, CNN-LSTM, LSTM-CNN, and LSTM-resCNN are presented in the first, second, third, and fourth columns, respectively. The results for the bike-sharing, PM 2.5 , and Bitcoin datasets are presented in the first, second, and third rows, respectively. For the bike-sharing dataset, note that the model predicts the 24-h data points (predictions for a day) using weekly data. In the bike-sharing data, the first five similar pick patterns are obtained for Monday-Friday and the following two patterns are for Saturday and Sunday. The data for Monday-Friday exhibited similar patterns than those for Saturday and Sunday. Two picks from the weekdays were estimated to be for bike commuting, going to work, and going back home. The Grad-CAM visualization of the CNN-LSTM model achieves two highly activated positions. However, elucidating the reason for these activations is difficult. Alternatively, the LSTM-CNN and LSTM-resCNN models focus more on weekday information than on weekend information, which is more reasonable because the patterns for the next Monday are being predicted. In particular, the LSTM-resCNN model focuses more on Monday data compared to MTEX-CNN, CNN-LSTM and LSTM-CNN models. It would be reasonable to concentrate on Monday's information more for future Monday predictions. On the PM 2.5 dataset, CNN-LSTM focuses on past pick information, which may not be reasonable. The LSTM-CNN and LSTM-resCNN models focus on plateau areas. In particular, the LSTM-resCNN model focuses on the large plateau areas for the next multistep prediction. LSTM-resCNN also focuses on the front areas. There must be a reason, but the model does not explicitly explain it, thus sometimes guesses are required. On the Bitcoin dataset, all models highly focus on the most recent periods for the next one-step prediction. This is reasonable because the Bitcoin price data may not show any striking pattern. The most recent data would have more information for predicting the future values.

2) VISUALIZING FEATURE-LEVEL IMPORTANCE
One disadvantage of Grad-CAM is that the result can only visualize the temporal-level importance. The featurelevel importance can be vitalized using the G and IG. Figs. 9(a) and (b) show the G and IG activations of previously selected periods from Fig. 8 from the bike-sharing, PM 2.5 , and Bitcoin datasets by employing the CNN-LSTM, LSTM-CNN, and LSTM-resCNN models. The activation value closer to 1 is indicated by darker colors than that closer to 0, implying more influence on the prediction. The results obtained using CNN-LSTM, LSTM-CNN, and LSTM-resCNN are depicted in the first, second, and third columns, respectively. The results for the bike-sharing, PM2.5, and Bitcoin datasets are presented in the first, second, and third rows, respectively. Nine variables of the bike-sharing data are used: rented bike count, temperature, humidity, wind speed, visibility, dew point temperature, solar radiation, rainfall, and snowfall. The most crucial variable found using all the three models based on the G and IG for predicting the rented bike count is the rented bike count variable from the most recent past, which is logically. The CNN-LSTM model was the most activated on Sunday, whereas the LSTM-CNN and LSTM-resCNN models were activated all week. On Monday, the LSTM-resCNN model with the G achieved more activation than CNN-LSTM and LSTM-CNN models. This is logical because it is a model for predicting the data points of the following Monday. Eight variables of the PM 2.5 data are employed: PM 2.5 , dew point, temperature, pressure, wind direction, wind speed, snow cover, and precipitation. For the prediction of PM2.5, the most important variable found using all the three models with the G and IG was the PM 2.5 value from the most recent past. Six variables of the Bitcoin data are employed: closing price, opening price, high price, low price, volume-to, and volumefrom. On the Bitcoin dataset, the most important variables found using the three models with the G and IG are the closing, opening, high, and low prices from the most recent past. This is logical because the target variable (close price) is highly correlated with the previous day's closing, opening, high, and low prices.

V. DISCUSSION
In this study, Explainability is introduced in the model using the principle of Grad-CAM. Based on a previous study, Grad-CAM is also applied to the time-series prediction model using the MTEX-CNN [25]. However, because the Grad-CAM can be applied to only CNN layers, it comprises only 1D CNN layers. The long-term dependencies are poorly modeled because models based on the Grad-CAM solely use CNN layers. We attempted to reflect the long-term dependency of LSTM by applying Grad-CAM to the LSTM-CNN architecture. Furthermore, we performed an experiment on the addition of a residual connection to 1D CNN layers. The LSTM-resCNN model achieved slightly better performance and visualization results than the LSTM-CNN model on the three datasets.

VI. CONCLUSION
Deep learning is widely used in time series problems. However, studies on explainable DL methods in the field of time series are lacking. In this paper, we propose an explanatory time-series prediction DL model, LSTM-resCNN, with a residual connection added to the LSTM-CNN model. The proposed model showed better performance than one-dimensional CNN, bidirectional LSTM, CNN-LSTM, LSTM-CNN, and MTEX-CNN models using the three different datasets of fine dust (PM2.5), bike-sharing, and bitcoin. LSTM-resCNN model achieved the best performance on six of nine evaluation criteria for one-step prediction scenarios, and the LSTM-resCNN model achieved the best performance on eight of nine evaluation criteria for multi-step prediction scenarios. Furthermore, we experimented with adding explanatory power to the proposed LSTM-resCNN model by applying gradient-based techniques: Grad-CAM, Integrated Gradients, and Gradients. The proposed LSTM-resCNN model worked well with these gradient-based approaches using three experimental datasets.
As a future study, an interpretable AI model incorporating Transformer [39] will be interesting. Due to its strong performance, the Transformer model is garnering interest in a number of domains, including language [40], speech [41], and vision [42]. In time series studies, attempts to include a Transformer are also growing [43]. Conformer [44] is a model with 1D convolution added before the multi-head attention layer of the Transformer encoder. Experiments such as applying Grad-CAM to the 1D CNN layer or placing the 1D CNN layer further back will be an interesting future study.