A Convolutional Transformer Model for Multivariate Time Series Prediction

This paper presents a multivariate time series prediction framework based on a transformer model consisting of convolutional neural networks (CNNs). The proposed model has a structure that extracts temporal features of input data through CNN and interprets correlations between variables through an attention mechanism. This framework solves the problem of the inability to simultaneously analyze the temporal features of the input data and the correlation between variables, which is a limitation of the forecasting models presented in existing studies. We designed a forecasting experiment using several time series datasets with various data characteristics to precisely evaluate the proposed model. In addition, comparative experiments were performed between the proposed model and several predictive models proposed in recent studies. Furthermore, we conducted ablation studies on the extent to which the proposed CNN structure in the prediction model affects the forecasting results by substituting a specific layer of the model. The results of the experiments showed that the proposed predictive model exhibited good performance in predicting time series data with a clear cycle and high correlation between variables, and improved the accuracy by approximately 3% to 5% compared with that of previous studies’ time series prediction models.


I. INTRODUCTION
ous applications such as manufacturing [1], medical field [2], 24 tourism [3], and transportation [4]. Therefore, it is crucial to 25 design a framework that efficiently and accurately forecasts 26 future variables. 27 With the development of AI-based forecasting methods, 28 recent studies have presented a time series prediction frame-29 work with deep learning models. Researches on designing a 30 The associate editor coordinating the review of this manuscript and approving it for publication was Baozhen Yao . time series predictive model using recursive models such as 31 recurrent neural network [5] (RNN), long-short term memory 32 (LSTM) [6], and gated recurrent units (GRU) [7] are predomi- 33 nant. Recent studies have proposed the transformer model [8] 34 for a time series prediction task to solve the long-term 35 dependency problem and parallel operation limitations of 36 the recursive models. These deep learning models are supe-37 rior to conventional regression models and machine learning 38 methods. 39 Although deep learning-based models show good predic-40 tion results, there are structural limitations for directly apply-41 ing these models to multivariate time series prediction. First, 42 neural network models used in previous studies have special-43 ized structures enabling them to process a single sequence. 44 Early studies using deep learning models mainly compose a 45 multivariate time series prediction framework with a structure 46 in which the same neural networks are collocated in paral-47 lel in as many as the number of variables. Recent studies 48 have designed models to deal with these problems, but they 49 one-dimensional CNN layer. The encoder extracts the com-66 pressed spatiotemporal features from the given multivariate    [30], [31], [32].

130
Existing multivariate prediction studies have demonstrated 131 good prediction performance achieved through extensive 132 experiments. However, the mentioned models do not pro-133 vide a precise manner for implicitly reducing the amount 134 of information in the entire multivariate data. Owing to the 135 nature of time series forecasting, which is advantageous for 136 prediction as the length of the time dimension is longer [33], 137 condensing the extensive data is crucial. To address this issue, 138 we designed a multivariate time series prediction model with 139 an improved transformer structure.

141
This section introduces the terms and notations used in this 142 study, and defines the multivariate time series prediction 143 problem before explaining the proposed model.

145
Let X t be the multivariate time series data at time point t to 146 be inputted into the time series prediction framework. This is 147 defined by (1).
Here, m is the length of the input multivariate time series data, 150 and x t is the multivariate vector at time point t. A multivariate 151 vector x t contains n variables for a single time point. The 152 value of n depends on the dataset used. This can be expressed 153 by (2).
Note that the numbers in parentheses represent enumerated 156 variables that are not based on any specific criteria. Description of input data and target data of multivariate time series prediction problem. A multivariate vector x t +r ∈ R n is inferred from the real-time series data X t ∈ R m×n .
In summary, the multivariate time series input X t ∈ R m×n 158 represents the historical multivariate data for a certain period 159 from (t − m + 1) to t.

163
In this paper, we aim to predict a multivariate vec- x t } as input. Note that r indicates the 167 output interval, which is a specific future point. The inputs 168 and targets of the defined problem are shown in Fig. 1.

170
We propose a novel transformer model for the defined mul- The overall structure of the model is illustrated in Fig. 2.  199 Note that x i indicates the multivariate vector of X t at the ith 202 time point. This procedure inserts positional information of 203 multiple dimensions (d) for a single instance, with the given 204 input data X t shaped n × m.
A 1D CNN is a neural network that performs a convolution 207 operation with a 1D filter. While the conventional CNN (two-208 dimensional CNN) is mainly used for image processing, the 209 1D CNN is used for 1D data tasks such as electronic signal 210 processing and audio data analysis.

211
The 1D CNN is also widely used in time series prediction 212 research. However, the time series forecasting problem dif-213 fers significantly from other signal processing tasks in that 214 future data cannot be referenced for prediction or long-term 215 dependency owing to their periodic features. For these issues, 216 variations in the 1D CNN are utilized for designing the time 217 series forecasting model. In this study, a variation of the 1D 218 CNN is applied to interpret the temporal features of the mul-219 tivariate time series input.

220
The dilated convolution method can be used to handle 221 sequential data. This convolution manner is a variation of 222 the 1D-convolutional operation that compresses long-length 223 information. By taking only a certain portion of the fea-224 tures from the previous layer, the amount of computation of 225 the input values calculated for each layer can be reduced. 226 The concrete formula for the dilated convolution operation 227 is given by (4).
Here, F(·) indicates the input feature, k(·) is the convolutional 230 filter, and l is the dilation rate. Note that p, s and t refers to the 231 positions of the features. The number of output features was 232 reduced by 1/l because only 1/l of the total calculation was 233 performed. With a deeper dilation layer, the neural network 234 can cover a wider receptive field, leading to the analysis of a 235 longer sequence input.

236
The causal convolution is a 1D CNN technique that con-237 siders that current data are only affected by past data. While 238 the normal convolutional network is an operation computes 239 through values adjacent to each other in both filter directions, 240 the causal convolution operation proceeds with values adja-241 cent to each other in one order (past direction) in each filter. 242 The causal convolution operation through the convolution 243 filter is given by (5).
The notations in (5) are the same as those in (4).  The attention mechanism plays a key role in the transformer 255 model. Determining the correlation between elements in a 256 single input is called self-attention. This mechanism can 257 determine the relevance of features that are far from each 258 other in the time domain, which recurrent models structurally 259 unextractable. Therefore, it is mainly used in natural language 260 processing to determine the relevance between elements. This 261 study utilized the self-attention mechanism to analyze the 262 correlation between variables in multivariate time series data. 263 The initial input of the attention layer is converted into a 264 Query, Key, and Value. A Query indicates the data affected 265 by a particular value and a Key represents the data that affects 266 a particular data value. The Value expresses the weight of 267 the influence. The Query, Key, and Value are computed by 268 multiplying the same initial input by the independent weight 269 matrices for each output. With positional-encoded input data 270 X t , the procedure of gathering the corresponding Query The Query, Key, and Value gathered using (6) are trans-275 mitted to the attention layer. The attention layer obtains the 276 attention score using the input Query and Key. The attention 277 score, which refers to the variable having a significant effect 278 on a variable in the multivariate input, is converted into a 279 Query and Key. Note that the dimensions of the Q t , K t , and V t 280 are n × m which are equal to the initial input. In this study, the 281 dot product method proposed by Luong et al. [34] was used 282 for attention score computation. We gathered the attention 283 score Score(Q t , K t ) through Q t and K t using (7).
Note that W q and W k are weight matrices that are addition-286 ally calculated for Q t and K t , respectively; b is the bias, and 287 Tanh(·) is the activation function, which is a tangent hyper-288 bolic function. The obtained Score(Q t , K t ) has the dimension 289 of n × n. 290 We then obtain the Content C t by using the attention score was obtained using (8).
The final result of the attention layer, C t has a matrix form

321
The difference between a fully-connected neural network and 322 PCNN is shown in Fig. 4.

324
The key role of the encoder layer is to extract spatiotemporal while maintaining spatial information. Finally, the 2-layer 337 DCCNN compresses the overall feature size while extract-338 ing the temporal features from the processed information. 339 By stacking multiple deep encoder layers, the entire encoder 340 block can better interpret the complex patterns of a given 341 multivariate time series input.

342
Additionally, we considered the Add and Norm process 343 between two neighboring component layers. The Add and 344 Norm process consists of residual connection and layer nor-345 malization operations. The residual connection process links 346 the results of each component layer with the features before 347 the component layer. Layer normalization is a normalization 348 method for finding the mean and variance of the features 349 in a batch. Unlike other normalization techniques such as 350 batch normalization and weight normalization, layer normal-351 ization is advantageous for handling sequential data features. 352 These two methods prevent the vanishing gradient issue in the 353 stacked encoder networks.

354
The final output of the entire encoder block is a spatiotem-355 poral feature, with a size much smaller than that of the initial 356 input. This feature was included in the prediction-generating 357 procedure for the decoder layer.

359
The decoder generates the final prediction result using spa-360 tiotemporal features extracted from the encoder. In our trans-361 former model, a single decoder layer consists of components 362 in a different order than that of an encoder. The asymmet-363 ric structure is empirically designed to output accurate pre-364 diction results from a given input using one-way analysis. 365 One decoder layer consists of the following components: a 366 DCCNN, Self-attention and PCNN. Similar to the encoder 367 block, the decoder block is stacked with multiple decoder 368 layers to generate the final prediction result from the initial 369 input. The detailed specifications of the decoder layer are 370 listed in Table 2.

371
The first decoder layer receives the positional-encoded 372 multivariate time series data as the input. The 2-layer 373 VOLUME 10, 2022    We standardized the datasets. The multivariate time series 413 data were standardized for every time series of a single vari-414 able. Model learning was stabilized by fixing variables with 415 different numerical scales to a constant scale (mean 0 and 416 variance 1). Along the time axis of the preprocessed dataset, 417 80% of the front part was used as the training data, and 20% 418 of the rear part was used as the test data. The proposed framework forecasts the expected multivariate 421 with the given historical multivariate time series, as men-422 tioned in Section III-B. We prepared the training and test data 423 by splitting the given dataset into data segments. The data 424 segments, consisting of the input matrix and output vector, 425 were created using the sliding window method. The sliding 426 window method is a data segment generation method that 427 slides the window into a single time unit for the entire dataset 428 period. For example, with the length of the given data L and 429 the size of the window being (m+r), a total of L −(m+r)+1 430 data pieces were created. Note that the first m multivariate 431 data of a (m + r)-sized window is the input (X t ), and the 432 (m + r)-th multivariate vector is the ground truth (x t+r ). The 433 process of creating data segments in a sliding window manner 434 for a given dataset is shown in Fig. 5.

436
As in a previous study [8], each of the encoder and decoder 437 block consists of six layers. The number of encoding dimen-438 sions mentioned in Section IV-A1 was set to 64. We define 439 the default input length (m) and output interval (r) as 90 and 440 90, respectively. The batch size of the input data was set to 32, 441 and the number of training epochs was set to 100. We utilize 442 Adam [36] optimizer with a learning rate of 0.001. The loss 443 function of the model is Mean Squared Error (MSE). 444

445
In this study, the root-mean-square error (RMSE), rooted 446 relative squared error (RRSE), and correlation coefficient 447 (CORR) were used as evaluation metrics. These three indi-448 cators are evaluation metrics that are mainly used in existing 449 time series prediction and regression studies. The definitions 450 of these indicators are discussed herein.

FIGURE 5.
Illustration showing the process of creating data segments. A data segment consists of an input matrix X t and an output vector x t +r . Note that X t ∈ R m×n and x t +r ∈ R n .
The RMSE is the positive square root of the squared error The RRSE differs from RMSE in that it is not divided 462 by the total amount of test data but by the statistic (square Here,x j is the average value of the jth variable.

471
The CORR is the correlation coefficient between the actual 472 and predicted values. The CORR value indicates whether the 473 overall trend is well predicted for the test data. This metric 474 is less sensitive than the RMSE and RRSE. The formula for 475 calculating CORR is shown in (11).
Note thatx j stands for the average value of the predicted jth and ground-truth data is shown in Fig. 6. 486 We observed the prediction results based on dataset fea-487 tures. For the prediction results of the exchange rate dataset, 488 the model predicted the overall trend with a small error. 489 In addition, it predicted data with periodicity and high accu-490 racy, as seen in the Electricity_var301 case. The proposed 491 model showed accurate prediction results even for data where 492 specific values are repeated at regular intervals (solar energy 493 dataset). However, the model forecasted the expected value 494 slightly inaccurately when dealing with data with large vari-495 ance or rapidly-changing data such as Traffic_var175 and 496 Electricity_var57. 497 We also determined that the forecasting model predicted 498 similar values with a slight lag from the actual values, regard-499 less of the data characteristics. This error occurs when a 500 distant time point is inferred from the input data. Neverthe-501 less, the proposed model predicted the result with a time lag 502 smaller than the defined output interval (m = 90). 503 a: COMPARISON WITH OTHER WORKS 504 We designed a comparative experiment with existing 505 prediction models to evaluate the objective performance of 506 the proposed forecasting model. The comparison models 507 are the general transformer [8] and the latest models of 508 time series prediction such as LogSparse Transformer [33], 509 Informer [37], LSTNet [35], and SpringNet [38]. The com-510 parative test results were evaluated based on the three eval-511 uation metrics mentioned in Section V-A4. The input/output 512 shape of all models were the same, m = 90 and r = 90. The 513 experimental results are listed in Table 4.

514
As shown in Table 4, the prediction performance of the 515 proposed model is almost the same as that of Informer and 516 SpringNet. The Solar-dataset dataset result is slightly infe-517 rior to other recent models because the proposed model face 518 adversity to predict accurate values (non-zero values) for 519 data in which zero and non-zero values appear periodically. 520 However, we note that our prediction model showed slightly 521 better accuracy than the other prediction models on the traffic 522 dataset and the electricity dataset. As described in Table 3 523 and Fig. 6, these two datasets contain many variables, and the 524 patterns of the variables are similar to each other compared 525 with the other two datasets (exchange rate, solar energy). 526 Therefore, considering the precise forecasting results for this 527 dataset type, it is apparent that the designed prediction model 528 can better interpret multivariate data with a higher correlation 529 between variables than the other existing forecasting models. 530 b: COMPUTATION COSTS 531 We check how much computation cost is required for the 532 designed model. We additionally note the FLOPs (FLoating 533 point OPerations) of the proposed model and other compar-534 ative models by using the exact same computing resources. 535 The results are in Table 5. 536 We observe that the proposed model has more FLOPs than 537 the general transformer model. Still, the computational cost is 538 slightly lower than the prediction models of the latest studies. 539 VOLUME 10, 2022   hyperparameter settings were the same as those of the origi-554 nally proposed model. The prediction results of this study are 555 shown in Fig. 7. 556 We observed that the prediction performance of the 557 redesigned model degraded, regardless of the datasets used. 558 In particular, the prediction accuracy for the exchange 559 rate dataset, which had a trend rather than periodicity, 560 drastically deteriorated (approximately 13-15%). From this 561 result, it is apparent that the PCNN layers of the pro-562 posed model, which preserve the spatiality of the hidden 563 states, have a substantial effect on predicting the overall 564 trend. 565 b: ENCODER-DECODER STRUCTURE 566 We proposed an asymmetric transformer model whose layer 567 orders of the encoder and decoder are different. The perfor-568   according to the structural transformation of the model is 577 shown in Fig. 7.

578
The results of the CORR metrics in the referred figure were 579 slightly worse than those of the original models. However, 580 in the cases of the RMSE and RRSE metrics, the performance 581 results were significantly lower than those of the original 582 model (approximately 10-12%). The proposed asymmetric 583 model predicted individual values better than the symmetric-584 structured model. 585 VOLUME 10, 2022 FIGURE 9. Evaluation metrics changes for each dataset according to the output interval.

586
In addition, we evaluated how the performance of the pro-  In this experiment, we performed the test while the output 591 interval was fixed and the input length was changed. r is fixed 592 at 30, and the experiment was performed while changing m to 593 10, 30, 60, 90. The experimental results are shown in Fig. 8.

594
There was no significant difference in the accuracy 595 between the prediction experiments with input time lengths of 596 60 and 90, except for the exchange rate dataset. We observed 597 that the performance improved with a longer input, especially 598 for data with a long-term trend, such as the exchange rate 599 dataset. For the electricity and solar energy datasets, which 600 have a distinct cycle of 24 hours, the prediction outcome with 601 an m = 30 showed a significant performance improvement in 602 all metrics compared with the result with m = 10.

603
The results of this experiment show that the proposed 604 model is significantly affected by the periodicity of the input 605 data. In addition, we observed that the input length and per-606 formance are directly proportional to the data showing a long-607 term trend. As in the previous experiment, we checked how the perfor-610 mance of the designed model changed as the output interval 611 increased. m was fixed at 30, and r was changed to 10, 30, 60, 612 and 90 to compare the forecasting performance of the near 613 future and far future. The results of the prediction tests are 614 shown in Fig. 9.

615
As shown in Fig. 9, the model shows a good performance 616 in predicting a point in the near future (r = 10) compared with 617 the input length, while sharp decreases in performance in the 618 CORR metrics are observed at r = 30. However, except for the 619 RRSE in traffic dataset, there was no significant performance 620 deterioration between r = 60 and r = 90, which are tasks that 621 predict a future point farther than the input length. 622 We observed that the performance of the proposed model 623 decreased as the prediction interval increased, particularly 624 in terms of overall trends. Furthermore, compared with the 625 prediction results in Section V-B1, a longer input length is 626 required to forecast distant future points. 628 We present a multivariate time series prediction model that 629 leverages a convolutional neural network with a transformer 630 model structure. The proposed model simultaneously ana-631 lyzes the correlation between the input variables and temporal 632 features of a given multivariate time series data in one model 633 while the existing methodologies are difficult to deal with. 634 In addition, we performed experiments using several time 635 series datasets with different data characteristics to demon-636 strate the superior predictive performance of the designed 637 model. The performance results of the extensive experiments 638 proved that the proposed model enables multivariate predic-639 tion at a future time point with high accuracy for many vari-640 ables and long sequences.