Umformer: A Transformer Dedicated to Univariate Multistep Prediction

Univariate multi-step time series forecasting (UMTF) has many applications, such as the forecast of access traffic. The solution to the UMTF problem needs to efficiently capture key information in univariate data and improve the accuracy of multi-step forecasting. The advent of deep learning (DL) enables multi-level, high-performance prediction of complex multivariate inputs, but the solution and research of UMTF problems is extremely scarce. Existing methods cannot satisfy recent univariate forecasting tasks in terms of forecasting accuracy, efficiency, etc. This paper proposes a Transformer-based univariate multi-step forecasting model: Umformer. The contributions include: (1) To maximize the information obtained from a single variable, we propose a Prophet-based method for variable extraction, additionally considering some correlated variables for accurate predictions. (2) Gated linear units variants with three weight matrices (GLUV3) are designed, as a gating to improve the function of selective memory in long sequences, thereby obtaining more helpful information from a limited number of univariate variables and improving prediction accuracy. (3) Shared Double-heads Probsparse Attention (SDHPA) mechanism reduces memory footprint and improves attention-awareness. We combine the latest research results of current DL technology to achieve high-precision prediction in UMTF.Extensive experiments on public datasets from five different domains have shown that five metrics demonstrate that the Umformer approach is significantly better than existing methods. We offer a more efficient solution for UMTF.

methods usually use the TSD from the previous steps as 23 model input, the data from the next step as labels. Due 24 to the minimal number of features in the univariate data, 25 traditional models for solving the UMTF problem learn very 26 little information, resulting in inaccurate results. Moreover, 27 some other challenges have remained. For example, applying 28 the latest multivariate time series forecasting [4], [5], [6] to 29 The associate editor coordinating the review of this manuscript and approving it for publication was Sajid Ali . specific univariate forecasting is hard, including extracting 30 features and improving accuracy. 31 UMTF currently faces two key challenges: one is to be 32 able to effectively extract some of the features that influence 33 prediction from the limited data available. The second is 34 to improve the multi-step prediction accuracy of the model 35 to meet the current and growing needs of society. These 36 challenges require researchers to provide: 37 1) a mechanism for univariate variable feature extraction 38 scientifically extracting feature variables in TSD; of exogenous variables for prediction [18]. This paper mainly 100 studies the methods of univariate sequence prediction. 101 The Autoregression (AR) [19] is a linear regression model 102 that describes random variables at a future time in terms 103 of a linear combination of random variables at a particular 104 time in the previous period, which was the standard method 105 used in early time series forecasting [20], [21]. However, 106 it has high requirements for data autocorrelation and can only 107 be used to predict scenarios that are heavily influenced by 108 historical factors, but not those that are heavily influenced 109 by social, natural, and other factors. Moving average(MA) 110 [22], [23] uses a moving average of white noise to simulate 111 TSD and calculates the average of the historical data as the 112 forecast for the next period. When there are more forecast 113 data, a large amount of data needs to be stored. Many studies 114 have shown that the forecast accuracy of moving averages 115 is low [24]. Autoregressive Moving Average (ARMA) [25], 116 [26] combine AR and MA models to reduce the number of 117 past parameters, Autoregressive Integrated Moving Average 118 Model(ARIMA) [27], [28], [29], [30] is a model built by 119 regressing the lagged values of the dependent variable on the 120 present and lagged values of the random error term, both of 121 them require the TSD to be stable after differentiation and 122 can only capture linear relationships. Exponential smoothing 123 (ES) [31], [32], [33] is a forecasting method that introduces 124 the smoothing factor, a simplified weighting factor, to obtain 125 a time series of averages based on the actual quantity and 126 the forecast quantity for the current period of a particular 127 indicator. It is a particular weighted average method in which 128 historical data closer to the forecast period is given with a 129 larger weight, and the weight decreases exponentially. But it 130 cannot discriminate the data turning point and is mainly used 131 for short-term forecasting.

132
In recent years, models based on deep learning have been 133 used for time series forecasting, including convolutional neu-134 ral networks(CNNs) and recurrent neural networks(RNNs). 135 The LSTM [34], [35], a kind of special RNN, is currently 136 used in practical prediction applications for the future by 137 selectively memorizing sequences [36]. However, one of the 138 main limitations of using LSTM to predict time series is that 139 the model relies heavily on asymptotic forecasting, so remote 140 forecasting may not be effective. Moreover, it is highly prone 141 to hysteresis [37]. In recent years, the Transformer [38], [39]

206
To begin with, we need to define the UMTF problem. 207 We provide the machine input: where each x is univariate data observed before 209 the current timestamp t. Then the predicted output values 210 are output by the prediction model: Y = {y t+1 , y t+2 , . . . , 211 y t+m }, where each y in the model output Y is the value 212 of the data for an equal time difference after the current 213 timestamp t.We assume that n observations are input to the 214 model, predicting values for m future time steps. Specifically, 215 we input n timestamps and corresponding label values up to 216 the current timestamp t into the model, and the model predicts 217 and outputs the values at m time steps in the future to complete 218 the prediction.  This paper is dedicated to addressing the challenges of 229 current univariate prediction methods and finding optimal 230 solutions. We propose an Umformer framework(Future 1) 231 that is unique and novel in terms of feature engineering, 232 model construction, etc. Major components of Umformer are 233 (1) Univariate TSD feature engineering: This includes 234 prophet-based feature decomposition, data pre-processing, 235 data classification, and feature selection. The consideration 236 of multiple variables greatly improves the effectiveness of 237 univariate prediction.

238
(2) Sequence to sequence: The encoder and decoder are 239 used for the input and output of the time series respectively, 240 predicting the change in TSD based on the first few inputs.

241
(3) GLUV3: as a gating to improve the function of selective 242 memory in long sequences, thereby obtaining more helpful 243 information from a limited number of variables.

244
(4) Transformer decoder: The designed Static Variable 245 Enhancement GRN (SVEGRN) considers the effect of sea-246 sonal variables, and the proposed SDHPA reduces memory 247 footprint and improves attention-awareness, increasing pre-248 diction accuracy.   decomposed into the following parts, The period term s (t) 291 represents the periodicity in weeks or years; the trend term 292 g (t) represents the non-periodical change trend of the time 293 series; the holiday term h (t) represents whether there are 294 holidays in the day; the remaining term ε t represents the error 295 term or is called the residual term, when y (t) = s (t)+g (t)+ 296 h (t) + ε t , the Prophet algorithm is to fit these items, and then 297 finally add them to get the predicted value of the time series. 298 We use the variable processing method of the algorithm to 299 apply to data preprocessing and decompose the data to obtain 300 multivariate time series data.

301
In conclusion, we have solved one of the bottlenecks of 302 univariate time series forecasting -univariate TSD cannot be 303 predicted with existing multivariate time series forecasting 304 models, it has too few correlated variables. We extracted 305 some important relevant variables through the prophet feature 306 extraction method, and added periodic variables, which will 307 improve the predictive ability of the model.  1) Static seasonal variables are seasonally relevant charac-312 teristics. As seasonal variables remain constant over a given 313 continuous-time data, they can be used as static covariates to 314 control the overall situation. The model will pay more atten-315 tion to changes in data characteristics within the same season, 316    (2) where ELU is the exponential linear unit(ELU) activation 358 function, η 1 , η 2 ∈ R d model are intermediate layers, LayerNorm 359 is standard layer normalization, and ω is an index to denote 360 weight sharing. ELU activation will act as an identity function 361 when W 2,ω i + W 3,ω c + b 2,ω 0, and will produce a constant 362 output when W 2,ω i + W 3,ω c + b 2,ω 0, resulting in linear 363 layer behaviour. We present a GLUV3-based component gat-364 ing layer to suppress the flexibility of any part of the structure 365 not necessary for a given dataset. Thus GRN can play the role 366 of variable feature selection. At each time step, an additional non-linear processing layer 378 is added, feeding p T is the processed feature vector for variable j. 381 Each variable has its calculation, sharing weights at all time 382 steps T . The processed features are weighted and combined 383 according to their variable selection weights as follows: where ω (j) vs T is the j-th element of vector ω vs T .

386
The strengths of TFT are mainly reflected in the feature 387 selection of GRN, similar to principal component analysis 388 (PCA) and the explainable multi-head self-attention mech-389 anism. Moreover, its GRN, as a threshold device in TFT, 390 is more like a replacement for the Dense layer. However, com-391 pared with the Dense layer, it extracts the effective compo-392 nents and improves the performance and learning efficiency 393 of the model.

394
The TFT framework is shown in Figure 3. Different types 395 of variables are fed into the corresponding variable selection 396 network. After the sequence-to-sequence model, the multi-397 head self-attention mechanism gets the weight of each vari-398 able into multiple gate and GRN layers, and finally into a 399 multi-step time series prediction value. 400 Next, in order to make accurate predictions for spe-401 cific data after univariate feature decomposition, this work 402 improves the GLU layer and the Multi-headed Attention 403 mechanism to increase the accuracy of multi-step prediction. 404 We propose GLUV3 to replace the original GLU to improve 405 This paper proposed GLUV3 to improve the ability to 457 retain important information in chronological order when 458 processing data, and the capacity to selectively forget and 459 remember information is also enhanced.
where a indicates the output of the previous layer and the 463 input of this layer,W 1 , W 2 and W 3 are the convolutions kernel 464 parameter, α, β and χ are the bias parameters, Gelu is 465 the Gelu activation function [65]. We replace the Sigmoid 466 activation with a Gelu activation and add a one-dimensional 467 product calculation.

468
To solve the gradient disappearance of the Sigmoid func-469 tion, we selected the Gelu function. As an activation function 470 that adjusts the output through a gating mechanism, the idea 471 of random regularity of the Gelu can more conveniently 472 improve the speed of gradient descent and learning. No matter 473 how large the input value is, its derivative will not tend to 474 be 0. To a certain extent, it avoids the problem of gradient 475 disappearance, and its fitting ability is faster and better than 476 Sigmoid. 477 Figure 4 shows the specific structure of GLUV3. The input 478 of this layer is a series of continuous TSD. A vector represents 479 the original series data. The calculation of the hidden layer is 480 calculated according to the above formula.

481
The output of each layer has a linear projection a * W 2 + α 482 modulated by the gated Gelu a * W 3 + β. Similar to LSTM, 483 these gates multiply each element of the matrix. In addition, 484 we add an overall linear projection, so these layers are passed 485 through three weight matrices to improve the accuracy of the 486 calculation.

487
Our method wraps the convolution and GLU in a 488 pre-activated residual block and adds the input of the residual 489 block to the output. One of the most effective choices is using 490 the Gelu layer and adding a one-dimensional weight matrix, 491 which does training and testing faster and more accurate.   General, the Attention mechanism is based on the rela-520 tionship between the key K ∈ R N ×d attn and the query 521 Q ∈ R N ×d attn , extended for the value V ∈ R N ×d attn .
whereQ is a sparse matrix of the same size as Q, containing 534 only u queries under the sparse metric − M (q i , K u ).Q u is the 535 matrix composed of the selected u number of q i , and the 536 unselected q i is initialized to the original Q r matrix by finding 537 the mean value after Attention (Q, K , V ), and the non-zero 538 values in the Q u matrix are updated to the Q r matrix to obtain 539 the final Q matrix. In fact, in the self-attentive computation, 540 the input lengths of queries and keys are usually equal, i.e. 541 L Q = L K = L V , making the total time complexity and space 542 complexity of ProbSparse self-attentive O (LlnL).

543
Multi-headed attention is used to enable models to jointly 544 focus on information from different representation subspaces 545 at different locations, which is extremely important in the 546 NLP domain for semantic extraction and can also be effective 547 in the UMTF domain when adequately utilized, i.e.
where W h K ∈ R d model ×d attn , W h Q ∈ R d model ×d attn , W h V ∈ 552 R d model ×d attn are head-specific weights for key, querie and 553 value, and W H ∈ R (m H ·d V )×d model , combine the outputs of 554 all heads H h . As mentioned above, because each head uses 555 a different value, the weight of attention alone does not guar-556 antee that the importance of a particular feature is reflected 557 and exploited. Therefore, the interpretability is enhanced 558 by modifying multi-head attention to shared values in each 559 head and additive aggregation of all heads when seeking a 560 two-head Attention mechanism.

561
The formula for finding double-headed attention is as 562 follows: where W V ∈ R d model ×d v are value weights shared across 568 heads, and W H ∈ R d attn ×d model are used for final linear 569 mapping. It can be found that different temporal patterns can 570 be learned between the two heads while noticing a common 571 set of input features, which can be interpreted as a simple 572 aggregation of attention.

573
The final value Doublehead (Q, K , V ), output by the 574 Attention mechanism, goes to the next level of computation. 575 VOLUME 10, 2022

576
In order to be able to consider the effect of seasonal variables 577 on forecasts globally, the SVEGRN we built integrates c e 578 variables and sequence to sequence outputs to improve the 579 efficiency of the model's fit to the variables. we also apply a gated residual connection that skipped the 597 entire transformer module, providing a direct path to the 598 seq2seq layer.  The quantile loss is defined as

762
In order to compare the generalisation performance of differ-763 ent learning algorithms across the board, it is not enough to 764 rely on a measure of sexiness on a particular dataset. We need 765 to use hypothesis testing, which provides an important basis 766 for our algorithm comparisons. Also we generally need to 767 compare the performance of multiple algorithms on multiple 768 datasets, and here the Friedman test and the Nemenyi test are 769 often used for comparison. 770 We tested the MSE of the results of each model in 185 long 771 series predictions. The hypothesis test rejected the hypothesis 772 that the performance of the six algorithms did not differ across 773 the five data sets. This indicates that the algorithms perform 774 significantly differently, at which point a follow-up test is 775 required to further distinguish each algorithm. We calculated 776 the critical value domain CD = 3.372 for the difference in 777 mean ordinal values by the Nemenyi test and the Friedman 778 plot is shown figure 6. It is demonstrated that the algorithms 779 differ significantly directly and that umformer has a greater 780 advantage over the other algorithms.

782
During the experiment, we specifically analyzed the influ-783 ence of various variables on the experiment. We analyze all 784 the features extracted in feature engineering and examine the 785 weights of variables affecting predictions. It enables us to 786 deeply analyze the influence of various variables on the pre-787 diction results, and to more flexibly select various variables 788 to input into the model when solving practical problems.

789
As shown in the figure 7 and 8, the influence degree of 790 each variable on the prediction is displayed. We use the 791 visualization tool to show the influence weight of different 792 variables on the prediction result. This is the input after we 793 remove some variables with small influence. For different 794 datasets, our input variables to the model are different. The 795 results show that known future variables have an impact on 796 the predicted results, especially some cyclical variables, it is 797 necessary to input into the model.  For the three data sets, we entered 11, 8, and 8 related 799 variables respectively. It can be seen that the effect of periodic 800 terms such as weeks and months is higher than that of other 801 variables. For datasets with strong periodicity, it is necessary 802 for us to enter the periodic term when entering.  It is finally demonstrated that SDHPA and GLUV3 as an 809 improved Attention mechanism and gated linear units result 810 in more significant prediction accuracy after the model is 811 trained to a certain level. It is worth noting that Scaled Dot-812 Product Attention+GLUV2 may also be helpful at certain 813 times that require further research in the future. SDHPA and 814 GLUV3 in the Umformer model are excellent structures to 815 use for prediction. Table 3 summarizes the prediction evaluation results for 817 the three datasets across the six methods. As the demand 818 for predictive power increases, we gradually lengthen the 819 prediction time step. The best results are highlighted in 820 bold. Compared with some existing models, the results of 821 umformer for univariate prediction are satisfactory. When