Take an Irregular Route: Enhance the Decoder of Time-Series Forecasting Transformer

With the development of Internet of Things (IoT) systems, precise long-term forecasting method is requisite for decision makers to evaluate current statuses and formulate future policies. Currently, transformer and MLP are two paradigms for deep time-series forecasting and the former one is more prevailing in virtue of its exquisite attention mechanism and encoder–decoder architecture. However, data scientists seem to be more willing to dive into the research of encoder, leaving decoder unconcerned. Some researchers even adopt linear projections in lieu of the decoder to reduce the complexity. We argue that both extracting the features of input sequence and seeking the relations of input and prediction sequence, which are respective functions of encoder and decoder, are of paramount significance. Motivated from the success of FPN in CV field, we propose FPPformer to utilize bottom-up and top-down architectures, respectively, in encoder and decoder to build the full and rational hierarchy. The cutting-edge patchwise attention is exploited and further developed with the combination, whose format is also different in encoder and decoder, of revamped elementwise attention in this work. Extensive experiments with six state-of-the-art baselines on twelve benchmarks verify the promising performances of FPPformer and the importance of elaborately devising decoder in time-series forecasting transformer. The source code is released in https://github.com/OrigamiSL/FPPformer.


I. INTRODUCTION
A. Background T HE advent of Big Data era has brought immense volume and variety of data in the 21st century, especially in Internet of Things (IoT) systems with tons of sensors [1].Consequently, it necessitates long-term time-series forecasting methods with demanding accuracy and efficiency to assist decision makers and engineers in the appraisal of sensor statuses and future plans.Since traditional forecasting methods based on statistics [2], [3] are no longer sufficient for such sophisticated situations, more and more data scientists pay their attention to deep time-series forecasting [4].After decades of development and competition, Time-Series Forecasting MLP (TSFM) [5]- [7] and Time-Series Forecasting Transformer (TSFT) [8]- [11] become the mainstream.

B. Problems
TSFM and TSFT have different pros and cons.TSFM is known for its parsimonious but efficient architecture so that Manuscript received xxxx; revised xxxx.(Corresponding author: Li Shen) Li Shen, Yuning Wei, Yangzhu Wang and Huaxin Qiu are with Beihang University, Beijing, China.(email: shenli@buaa.edu.cn;yuning@buaa.edu.cn;wangyangzhu@buaa.edu.cn;qiuhuaxin@buaa.edu.cn)forecasting models based on TSFM excel in resisting nonstationarity brought by distribution shifts [12] and concept drifts [13].Conversely, forecasting models based on TSFT own more complicated architecture and better capability of capturing long-term dependencies of time-series at the expense of being more vulnerable to over-fitting problem caused by non-stationarity [5].Fortunately, pioneers have striven to get around plenty of problems of TSFT.Direct forecasting strategy [14] reduces the time complexity and alleviates the error accumulation problem [15].RevIN [12] solves the problem of distribution shifts among windows with distinct time spans.The channel-independent [16] forecasting method renders TSFT refraining from extracting vague inter-relationships of different variables.Patch-wise attention mechanism [10], [16] further attenuates the space complexity and brings the capability of local feature extraction to TSFT.Indeed, recent works have proven that TSFT models [9], [17] can also be stable and robust in forecasting.Evidently, the majority of these enhancement focus on improving the encoder architecture and tackling input sequence features.It cannot be denied that they are very important, but not solely.The connections of input and prediction sequences, manifested by decoder in TSFT, are also of paramount significance, especially for pursuing precise forecasting in IoT.However, its significance is frequently omitted and itself is inadequately explored.Normally, the decoder architectures of existing TSFT models are simply duplicates of their encoder architectures, barring with little indispensable modifications, such like changing self-attentions to cross-attentions [8], [15].Furthermore, some researchers have gone so far as to substitute decoder in TSFT with simple linear projection [16], [18], which is analogous to TSFM, for the sake of enhancing their efficiency.Now it is time to enhance the decoder of TSFT to fully develop its potential and push its forecasting performances to a new altitude.

C. Contributions
Different from existing TSFT models, we Fully develop the tried-and-tested Patch-wise attention mechanism and Pyramid architecture in both encoder and decoder and thereby propose FPPformer.Alike FPN [19] and PAN [20] architectures which are prevalent in CV fields, FPPformer hierarchically extracts input sequence features from fine to coarse and constructs prediction sequence from coarse to fine.To strengthen the feature extraction capability of patch-wise attention, we further insert an element-wise attention block into each patch-wise attention block to extract fine-grained inner-relations of each patch in encoder and decoder with merely linear complexity.
A channel-independent and temporal-independent embedding method is utilized to modify the size of feature maps in FPPformer to meet the needs of element-wise attention and patch-wise attention.Within each attention block in encoder, the diagonal line of query-key matching matrix is masked to ensure the generality of features extracted from input sequences.Primary contributions of this work are five folds: 1) We propose a novel time-series forecasting Transformer, i.e., FPPformer, which uncommonly and efficaciously improves the decoder architecture of TSFT to break its fetters and excavate its potential.2) We renovate the decoder architecture of TSFT and change it into top-down architecture for the sake of rationally constructing the prediction sequence in a hierarchical manner.3) Motivated by a pioneer anomaly detection method, we propose diagonal-masked self-attention to mitigate the negative impacts of the outliers in input sequences.4) A new combination of element-wise attention and patchwise attention is proposed by us to compensate the weakness of conventional patch-attention in extracting the inner-features of each patch, with only additional linear complexity.

5) Extensive experiments under diverse settings validate
that FPPformer is capable of reaching state-of-the-art on twelve benchmarks with peerless accuracy and robustness.

II. RELATED WORKS
The past few years have witnessed the development and the success of deep-learning based forecasting methods.Thanks to the help of neural network, the long-term multivariate forecasting is no more a dream so that even IoT systems with plenty of sensors and explosive data can be predicted [1], [21], [22].Researchers have developed deep forecasting methods built upon diverse networks and Transformer is a hot topic among corresponding literature.
a) Time-Series Forecasting Transformer: Traditionally, Time-Series Forecasting Transformer (TSFT) executes the forecasting via encoder-decoder architecture.The Transformer encoder is used to extract the features of input sequence, then the Transformer decoder is able to construct the prediction sequence by the extracted features of encoder and the prediction sequence, which is initialized with a certain number since it is unknown at the beginning.These two processes are completed predominantly by attention mechanism, thereby researchers always keep an eye on it.LogSparse Transformer [23] and Informer [15] discover the sparsity of query-key matching matrix and they force the elements of query to attach to the partial elements of key for the sake of reducing the complexity.Autoformer [24], FEDformer [25] and ETSformer [26] combine the TSFT with seasonal-trend decomposition and signal processing method, e.g., Fourier Analysis, in attention mechanism to enhance their interpretability.Patch-wise attention is more popular and proven to be more useful recently.TSFTs with patch-wise attention, including Triformer [10], Crossformer [8] and PatchTST [16], achieve more promising performances than preceding models.However, whichever TSFT always emphasizes that the modified architecture is intended for more efficient or effective feature extraction for input sequence.Hardly ever can statements involved with the profits of decoder be found.Indeed, their decoders seem to play the role of requisite appendages in the entire Transformer architecture.Once some parts of encoders are changed by their proposed methods, mirrored changes are made to their decoders.Some researches [16], [18] even abandon the decoder to circumvent these changes.Contrary to them, studying and figuring out the correct way of designing decoders in TSFT is exactly what this work is supposed to do.
b) Other Miscellaneous Deep Forecasting Methods: Barring TSFT, there are plenty of other types of deep forecasting methods.Forecasting methods based on RNN and CNN are feasible ones.Their respective representatives LSTNet [27] and SCINet [28] both achieved shiny performances during their periods.However, compared with the foregoing two types of deep forecasting methods, Time-Series Forecasting MLP (TSFM) relatively receives more attentions.These forecasting networks are solely comprised of linear-projection layers, whereas they still achieve promising performances.Due to their simple architectures, it is convenient for them to combine with statistics models for the objective of improving their interpretability and forecasting capability.NBEATS [29] and DLinear [5] adopt seasonal-trend decomposition methods in their networks more concisely than FEDformer [25] but achieve better results in general.C. Challu et al [7] further presented N-HiTS that employs sampling and interpolation strategies on the basis of NBEATS for more precise and hierarchical prediction.Reconstruction method motivated from Legendre Polynomials is taken into account by T. Zhou et al [6] to come out with FiLM.TSMixer proposed by V. Ekambaram et al [30] considers the temporal patterns, crossvariate information and additional auxiliary information to render TSFM ready for more complicated forecasting cases.They are challenging competitors for TSFT and we chiefly compare FPPformer with other TSFTs and these TSFMs in forthcoming experiments.

III. PRELIMINARY a) Problem Statement:
This work primarily concentrates on multivariate forecasting problem.As the term suggests, a multivariate forecasting problem is to predict a certain window {x t2:t3 } 1:V with time duration of (t 3 −t 2 ) and variable number of V with its anterior window {x t1:t2 } 1:V .Each x v t ∈ R, where t ∈ [t 1 : t 3 ) and v ∈ [1, V ], denotes an element at timestamp t and stemming from variable v.There are quite a few nomenclature style to name the dimension of t and v.In this work, the dimension of t is termed the temporal dimension and the dimension of v is termed the variable dimension.Note that the word 'channel-independent' mentioned above refers to the independence at the variable dimension.Moreover, we still discuss several univariate forecasting cases where V = 1 in our experiments since the variable treatment strategies of different forecasting methods can be very distinctive, making multivariate forecasting comparison, albeit prevailing, not persuasive enough.b) Vanilla TSFT Architecture: The architecture of vanilla TSFT is chiefly composed of an encoder, a decoder and a projection layer.An example of a vanilla TSFT with 2stage encoder and 2-stage decoder is sketched in Fig. 1.We can see that input sequence x in passes through the encoder embedding layer, then the superimposition of the embedded input sequence X in and its position embedding P enc , which covers the input time span, is sent to the encoder.The encoder processes X in + P enc with M (M = 2 in Fig. 1) stages, each consisting of a self-attention block and a feed forward layer, and the ultimate output feature map of the encoder is X enc .Typically, the canonical self-attention conducts scaled dot-products with the formula of (1): where q, k, v ∈ R L×D are the linear projections of the identical sequence tensor, L is the number of tokens (or sequence tensor length) and D is the hidden dimension.Readers can refer to [31] to be familiar with attention mechanism and feed forward layer.Correspondingly, the decoder receives both X enc and the zero-initialized prediction sequence 0 pred .0 pred propagates through the decoder embedding layer to obtain X pred and its position embedding P dec , which covers the prediction time span, is superimposed on it.Afterwards, they are sequentially sent into N (N = 2 in Fig. 1) stages, each consisting of a masked self-attention block, a crossattention block and a feed forward layer.Causality is essential as the prediction sequence is unknown, so that the masked self-attention, rather than normal self-attention, is utilized in decoder.The cross-attention block is intended to construct the prediction sequence via the encoder feature map X enc .Eventually, a projection layer maps the output feature map of decoder X dec to the prediction sequence x pred .
c) Employed Mechanisms: Barring our proposed methods, which will be introduced in the upcoming section, we also employ several advanced time-series forecasting mechanisms in FPPformer: 1) Direct forecasting method [14], which is widely employed by recent deep forecasting method, performs the prediction of the entire sequence with only one forward process to alleviate the error accumulation.2) Channel-independent forecasting method, which has been mentioned in the foregoing sections, treats the sequences of different variables as different instances.
The sequences of different variables are parallel sent into the network without interfering with each other so that the network can seek shared characteristics of different variable sequences without imposing any inductive bias to the correlations of different variables.3) RevIN [12], which is devised for non-stationary timeseries forecasting, normalizes each input sequence with its own statistics before sent into the network and restores the original statistics to the prediction sequence via the reverse instance normalization to handle the distribution shifts of real-world long time-series.4) Patch-wise attention [16], which segments the sequence into patches of the same length, treats each single patch, rather than each single element, as a token in (1) and treats the elements inside each patch or their latent representations as the hidden features of each token (patch) for better efficiency and generality.
Note that the majority of recent deep forecasting models [5]- [8], [10], [16], [28], [30], [32], including those employed in our experiments, at least adopt two of above mechanisms so that they are not something that distinguish our methods from others.

A. Analysis of Decoder in TSFT
Before commencing the introduction to our proposed FPPformer, we point out some deficiencies of current decoder architecture in TSFT to clarify the necessity and rationality of the decoder improvement in FPPformer.
We first discover the redundant self-attention problem in decoder.To elaborate, we notice that the input to decoder in Fig. 1 is a zero-initialized prediction sequence 0 pred owing to the unknown future.Consequently, the first (masked) selfattention is performed only on the position embedding of the prediction sequence.No matter it is fixed [15] or learnable [8], it is completely independent of input sequence.Keep in mind that time-series forecasting is an auto-regressive problem.It makes no sense to perform (masked) self-attention only on position embedding with the attempt of deducing some groundless relations.Moreover, position embedding is always static after training while the input sequence can be dynamic and non-stationary, shattering the last hope that the position embedding can fit the statistics of time-series sequences due to some sort of assumptions with respect to homogeneity [33], [34].Start token [15] can be a solution, however short start token is not enough for long-term forecasting and longer start token brings about excessive complexity.Besides, the connection of encoder and decoder is unitary, leading to the multi-scale insufficiency problem.As shown in Fig. 1, the encoder with M stages can produce M feature maps of input sequence, however merely the last one is sent to decoder.This problem is more noteworthy when it comes to some modified TSFT with hierarchical architecture in encoder.For instance, Informer [15] employs convolution layers between every two stages in encoder and FEDformer [25] keeps decomposing input sequences to acquire more precise seasonal features but neither of them apply the same operations to decoder and only the feature map of the last stage in encoder is sent to decoder.Crossformer [8] merges adjacent segments to obtain bigger patches in deeper stages in both encoder and decoder.However, just as what we claim in foregoing sections, the architecture of decoder is merely the replica of encoder with an additional cross-attention in Crossformer.By merging patches from small to big in decoder, Crossformer attempts to construct the unknown prediction sequence from fine to coarse, whose irrationality is selfevident.

B. Model Architecture
The overview of our proposed FPPformer is illustrated in Fig. 2 and its major enhancement on vanilla TSFT concentrates on addressing the preceding two problems of decoder.Comparing the schematics in Fig. 1 and 2, the differences with respect to the overall architecture can be readily noticed.To handle the first redundant self-attention problem in decoder, we change the order of the self-attention and cross-attention in decoder.Thereby, before embarking upon deducing any relations within unknown prediction sequence, the prediction sequence receives the auto-regressive parts from the deepest encoder feature map, which serves as a better role for prediction sequence initialization before the first self-attention in decoder than simple zero-initialization with start token [15], randomly generated parameters [8], the trend decomposition of raw input sequence [25], and so forth.It is evident that the latter initialization formats of other TSFTs are either relatively simple or inefficient.
We employ the hierarchical pyramids both in encoder and decoder with lateral connections to tackle the second multiscale insufficiency problem.As we adopt the patch-wise attention in FPPformer, the patches are merged before sent to the next stages in the bottom-up architecture of encoder and opposite operations, i.e., the splitting, are performed in the top-down architecture of decoder.The feature map of input sequence gets deeper and more coarse-grained within later stages, which is also the property shared by encoders of many TSFTs.However, things get different when we attempt to construct the prediction sequence from the position embedding and encoder feature maps.Recall how we decompose and reconstruct an arbitrary sequence from a certain multiresolution analysis {V j } j∈Z of L 2 (R) and wavelet spaces {W j } j∈Z in wavelet theory [35], which owns a transcendent position in signal processing.When we decompose certain sequence f i+1 (t) ∈ V j+1 , we decompose it into coarser spaces V j and W j .Whereas the reconstruction is opposite, i.e., we recover the sequence in finer space f i+2 (t) ∈ V j+2 from V j+1 and W j+1 .Omitting the existence of wavelet space W j , which contains the information of details or noises, we can find that the encoder and decoder processes separately correspond to the decomposition and (re)construction processes in multiresolution analysis.From another perspective, the unknown prediction sequence is initialized with zero or other parameters not pertaining to the ground truths at first.Thereby, when we strive to construct it from input sequence features, it is natural to commence with the most universal features to ensure the exactitude of general characteristics of prediction sequence features, then we can prudently take steps to seek finer features of prediction sequence to avoid over-fitting.The success of FPN [19], which also employs bottom-up and topdown architectures, in CV fields further confirms the preceding idea.Therefore, we keep splitting the patches in decoder and commence the hierarchical prediction sequence constructions with the feature maps from encoder, separately with identical resolutions, via lateral connections.The encoder in FPPformer presents a bottom-up architecture while the decoder presents a top-down architecture.Differences between the hierarchical design in Crossformer [8], whose decoder architecture is merely a replication of that encoder1 .M. A. Shabani et al [9] also notice the analogous thing but they neither expound the reasons of doing so nor they carry out any change to decoder.

C. Combined Element-wise Attention and Patch-wise Attentions
The preceding two problems are shared by the majority of TSFTs and we would like to mention another specific problem of TSFT with patch-wise attention.We name this type of Transformer PTFST for brevity.Different from the elementwise attention of vanilla TSFT, which seeks the correlations of sequence elements, the patch-wise attention seeks the correlations of different patches or segments of sequences to improve the efficiency and reduce the risk of over-fitting.PatchTST [16] and Crossformer [8] are such PTSFTs and their experiments have proven the superiority of patch-wise attention.However, they neglect the inner-relations of the elements inside the patches or only employ simple linear projections to mix them up.Therefore, we make different changes in employed patch-wise attention in encoder and decoder for the sake of extracting more fine-grained features in encoder or pursuing finer prediction sequence construction in decoder.
As shown in the schematic of Fig. 2, a element-wise attention block is arranged before each patch-wise block in every encoder stage to extract the inner-relationships of all patches before seeking their inter-relationship.This elementwise self-attention is patch-independent so that the additional complexity is O(P 2 ×L/P ) = O(L×P ), which is linear with input sequence length.P is the patch size and L is the input sequence length in the last complexity expression.Observing that the element-wise attention requires the preservation of independent sequence elements information, therefore we cannot directly map the initially segmented patches into the latent space like other PTSFTs otherwise the element-wise information is no longer preserved and element-wise attention cannot be implemented.To address this issue, we adopt a channel-independent and element-independent embedding method.As illustrated in Fig. 4 and Table I, the input sequence elements of different timestamps do not interfere with each other during the embedding process and reshaping operations are performed for the different needs of tensor shapes of element-wise attention and patch-wise attention.
The changes to decoder's attention are analogous but not completely.Since decoder itself has already owned two attention blocks, we maintain the patch-wise attention in crossattention block to ensure the general construction of prediction sequence via auto-regressive process.Simultaneously, the masked self-attention block of vanilla TSFT is transformed  to element-wise self-attention block, which is also patchindependent, in FPPformer.Just as we mentioned in the preceding sections, the prediction sequence is unknown so that we need to foremost guarantee the correctness of its general characteristics, manifested by placing patch-wise attention before the element-wise attention in decoder, then we can pursue the fine-grained features of prediction sequences without overfitting.As the patch-wise cross-attention treats each patch as a unity, the respect to causality within prediction sequence, i.e., the masking to the upper triangular parts of query-key match matrix, is superfluous in the decoder of FPPformer.

D. Diagonal-Masked (DM) Self-Attention
It is known that outliers always occur in real-world systems, especially for IoT systems owning immense and diverse data.These anomalies sometimes exist in the form of small patches [36] so that patch-attention cannot be immune to them.Compared with smoothing with filters [6], [7], which is natural but

Diagonal-masked patch-wise self-attention
Diagonal-masked element-wise self-attention (The 3 rd patch) Fig. 5.An example of query-key matching matrices in diagonal-masked element-wise self-attention and patch-wise self-attention.The total sequence length is 48 and the sequence is divided into 4 patches, each with the length of 12, in this example.The white hollow boxes denote the normal unchanged matrix elements while the black solid boxes, i.e., the matrix elements at the diagonal, denote the masked matrix elements.
not flexible enough, it is better to devise mechanisms inside the networks to circumvent the negative effects of outliers in the latent space.Representation learning is a fair answer [16] but needs heavy parameter tuning and not very stable.Directly masking the input sequence [37] gives rise to another problem analogous to the preceding position embedding problem in decoder since the fixed parameters cannot sufficiently represent the dynamic sequence features.Enlightened by [38], we mask the diagonal of query-key matching matrix of both element-wise and patch-wise self-attention blocks in encoder, as sketched in Fig. 5. Thereby, any element(patch) during the attention can merely be expressed by the values of the rest of the elements(patches).Those elements or patches whose characteristics confine with the general ones are scarcely affected but the outliers are impossible to be expressed by normal elements(patches), hence their values are restored to approach the general level and the negative effects of them are mitigated.

E. Projection
The prediction sequence is acquired through the summation of the linear projections of encoder output and decoder output.The first linear projection is supposed to represent the linear correlations of input and prediction sequence while the second linear projection, together with the entire decoder, is supposed to represent the non-linear correlations.The loss function (2) is the summation of MSE function (3) and MAE function (4) according to [39], [40].
V. EXPERIMENTS We attempt to answer three questions via the experiments on FPPformer: 1) Can FPPformer outperform temporarily state-of-the-art TSFTs and TSFMs on commonly-used benchmarks with the settings of both short input sequence length and long input sequence length (Section V-C)? 2) Are the unique mechanisms proposed to be applied in FPPformer literally effective or useful (Section V-D) and what's about their parameter sensitivity (Section V-E)? 3) Why does FPPformer own better or worse performances than other baselines?Can we figure it out via visualization (Section V-G)?

A. Baselines and Datasets
To unveil the empirical forecasting capability of FPPformer, we perform multivariate forecasting experiments on eight benchmarks involved in four types of IoT systems, including electricity consumption (ETTh 1 , ETTh 2 , ETTm 1 , ETTm 2 [15], ECL [41]), traffic flow (Traffic [42]), meteorological conditions (Weather [43]) and solar power production (Solar [44]).Their numerical details are presented in Table II.Eight temporarily state-of-the-art forecasting baselines, including four TSFTs (Triformer [10], Crossformer [8], Scaleformer [9], PatchTST [16]) and two TSFMs (FiLM [6] and TSMixer [30]), are employed to make comparison with FPPformer.It is worth mentioning that they are all superb forecasting methods proposed in the recent two years.Specially, besides Scaleformer, the other three of four TSFTs are PTSFTs so that FPPformer does not have an edge on pure attention mechanism design.Furthermore, we notice that these six forecasting baselines own different variable treatment strategies, e.g., some of them are channel-independent while some of them are not, which means that multivariate forecasting results cannot fully typify their forecasting capabilities.Therefore, we additionally perform univariate forecasting results on M4 dataset [45], which is a competition dataset qualified for univariate forecasting, rather than delibrately choosing a variate within the above multivariate forecasting datasets to perform univariate  forecasting experiments like many other researches [6], [8]- [10].Its details are elaborated in Table III.

B. Implementation Details
We would like to present a persuasive and fair comparison of FPPformer and other baselines, therefore we set the hyperparameters of FPPformer identical to the commonly-used ones.The input sequence lengths are different in different sub-experiments but are kept identical for all baselines.The number of stages are 3 in both encoder and decoder of FPPformer, and the size of the initial segmented patch is 6, which are in accordance with those of Crossformer [8].The embedding dimension D (in Fig. 4) is 32.As for hyperparameters with respect to the training process, FPPformer is trained via an Adam optimizer with the learning rate of 1e-4, which decays by half per epoch with totally ten epochs and the patience of one.The batch size is 16 and the dropout rate is 0.1.These are all commonly employed settings.All experiments, which are conducted on a single NVIDIA GeForce RTX 3090 24GB GPU, are repeated for five times with casual seeds and the average results are presented.The source codes are implemented by Python 3.8.8 and Pytorch 1.11.0 in https://github.com/OrigamiSL/FPPformer.Correspondingly, the other baselines used in this work also merely employ the fixed hyper-parameters and settings, which are chosen after referencing their default ones.As for those with multiple choices and versions, we choose the one that owns the best general performance.How we use the other baselines for experiment can all be found in our provided GitHub repository.The best results in each table are highlighted with bold and italic and the second best are highlighted with underline and italic, barring a special table in Section V-D.

C. Quantitative Results
We commence with the multivariate forecasting experiments, whose results are shown in Table IV and Table V.Under many real-world occasions, training samples are limited so that long input sequence length is not always available for some deep forecasting methods needing it for satisfactory performances.Therefore, we first measure the performances of FPPformer and its six competitors in eight multivariate benchmarks with input sequence length of 96, which is ascribable to well-known Autoformer [24] in Table IV.The prediction lengths are commonly agreed-upon {96, 192, 336, 720}.Then we evaluate the performances of the same seven models and datasets using longer input sequence length within {192, 384, 576}, whose results are shown in Table V. MSE (3) and MAE (4) are utilized as the evaluation metrics.The average results  of eight benchmarks with prediction length of 720 are given in Table V to refrain from tedious data stacking.Full results are given in our released repository provided in Section V-B.'-' refers to the fact that certain model is out of the memory (24GB) even batch size is set to 1.
As Table IV and V show, FPPformer outperforms other baselines in most of situations with both short and long input sequence lengths.When input sequence length is set to 96, FPPformer obtains 31.7%/60.0%/10.5%/6.4%/12.8%/14.7%MSE reduction compared with Triformer/Crossformer/ Scaleformer/PatchTST/FiLM/TSMixer, which illustrates the superb forecasting capability of FPPformer with the setting of short input sequence length.Though it seems that FPPformer fails to own a superior performance when experimenting on Solar dataset, FPPformer reconquers its leading position with longer input length when handling the same dataset (Concrete results are available at github repository provided in Section V-B).Furthermore, if also equiped with cross-variable attention2 like Crossformer, which means that a cross-variable attention module proposed by Crossformer is arranged at the end of each stage of the encoder and decoder in FPPformer, the modified FPPformer, denoted by FPPformer-Cross in Table VI, is capable of completely outperforming Crossformer and iTransformer [46], which is another state-of-the-art model employing cross-variable attention, under Solar dataset.
Then we compare the univariate forecasting capability of FPPformer with other six baselines on M4.We omit the first two subsets with sampling frequencies of a year and a quarter since many of their instance lengths are too short, and only perform experiments on the rest of four subsets.The prediction lengths, which are regulated by [45], are {18, 13, 14, 48} for {M4-Monthly, M4-Weekly, M4-Daily, M4-Hourly}.The input sequence lengths for them are separately {72, 65, 84, 336} after consulting [29].We change these four input sequence lengths a little to {72, 72, 96, 384} for the sake of rendering them fitting the patch-wise attention in FPPformer.The M4specifical metrics SMAPE (5) and OWA (7) are used for measurement.m refers to the periodicity of series and naïve2 refers to the results of a seasonally-adjusted forecast model by [45] for scaling in OWA.

D. Ablation Study
We conduct ablation studies to validate the functions of the architecture of FPPformer and its components.All variants are experimented with multivariate benchmarks with prediction sequence length of 720.The results of all eight benchmarks are presented in Table VIII.Each value is the average of four sub-experiment results with input sequence lengths of {96, 192, 384, 576}.As expected, only using point-wise attention like canonical TSFT gives rise to a losses increasing of 84.6% (Average loss: 0.345→0.637).Meanwhile, only using patch-wise attention like other PTSFTs does not suffer a severe performance degradation but still owns apparently worse performance than our proposed combined patch-wise attention and point-wise attention (Average loss: 0.380 vs. 0.345).As for the decoder architecture design, FPPformer surpasses the same models but replacing the decoder architecture of FPPformer with that of Crossformer (Average loss: 0.345 vs. 0.389) or simply substituting the entire decoder with a linear projection like PatchTST (Average loss: 0.345 vs. 0.376).Besides, removing DM mechanisms in the selfattention blocks of encoder also results in worse performances (Average loss: 0.345→0.379).Conclusively, the efficiency and necessity of all unique parts and architectures of FPPformer are verified.

E. Parameter Sensitivity Analysis
It is well-known that TSFT cannot own too many layers or stages, otherwise the risk of over-fitting substantially rises.Thereby, we perform parameter analysis on the stage number of FPPformer in this section to check out whether FPPformer is capable of handling this problem.The parameter analysis of patch size is no longer tested as it has been well studied by [8], [10], [16].The input sequence length is chosen as 576 for using more stages for FPPformer and other models use their default input sequence lengths, which are supposed to be best for them (96 for {Scaleformer, Triformer}; 512 for PatchTST; 336 for Crossformer).The number of stages are chosen in {1, 2, 3, 4}.The prediction sequence length is set as 720 and the average results (MSEs) of all eight multivariate benchmarks are presented in Table IX.The result of each model with stage number of one is utilized as the normalization factor to further measure their absolute performance deviations with more stage numbers which are manifested by (8).M i (i = 1, 2, 3, 4) are the original average MSE results and Mi (i = 1, 2, 3, 4) are the normalized ones.When comparing the performances and performance deviations of five TSFTs, it is evident that FPPformer not only keeps its leading position (smaller errors) among all TSFTs with different stage numbers, but also maintains its robustness ascendancy (smaller deviations) over other TSFTs.

F. Complexity Analysis
We compare the training time per epoch/the GPU memory computation/the inference time per instance on GPU of FPPformer and Crossformer [8], and the identical four measuring criteria of FPPformer without decoder and PatchTST [16] during multivariate forecasting under Solar dataset.Solar dataset is selected in that it owns the intermediate variable number among all datasets.Crossformer and PatchTST are chosen as they are also patch-wise attention based models.The decoder of FPPformer is removed when compared with PatchTST since PatchTST does not employ the decoder architecture.The input sequence lengths are chosen within {96, 192, 384, 576} and the prediction sequence length is 96.The other hyper-parameters and settings are identical with those used in the quantitative multivariate results, barring the size of the hidden(embedding) dimension.The size of the hidden dimension, which is the one that exceedingly affects the model complexity, is identical for all baselines in this experiment so that the model architecture design can determine the model complexity to the utmost extent.These two experiment results are shown in Fig. 6(a)(b) respectively.
As Fig. 6(a) shows, the full FPPformer owns linear computation and space complexity with input sequence length.Moreover, Fig. 6(b) illustrates that the encoder of FPPformer (FPPformer-Enc for short) also only owns linear complexity and the complexity of FPPformer and PatchTST are analogous, demonstrating that the element-wise attention, which is almost the only difference between PatchTST and FPPformer-Enc, merely brings minuscule additional complexity.

G. Case Study
To more vividly illustrate the outstanding forecasting performance of FPPformer, we visualize several forecasting windows of FPPformer and other TSFTs from different datasets in Fig. 7. Benefitting from exploiting better decoder and attention mechanism, FPPformer excels in capturing the features of trends (Fig. 7 (Fig. 7(c)) so that smaller forecasting errors than others can be obtained.Moreover, its preponderance of robustness and immunity against outliers are revealed in Fig. 7(d)(e)(f) where certain distribution shifts occur in partial distincts.In addition, we present some visualizations of the feature maps of FPPformer and several competitors in the latent space to validate the functions of its unique modules.
a. We visualize the attention score distribution of the first DM patch-wise and DM element-wise attention in FPPformer with a certain input window of length 96 in ETTh 1 dataset, via heat map in Fig. 8.As illustrated in Fig. 8(b1), the attention score distribution is uniformly distributed if applying the DM patch-wise attention during the training phase.Even substituting the DM patch-wise attention with the normal attention solely in the testing phase (Fig. 8(b2)) will not lead to the self-matches with high scores, demonstrating the enhancement of DM attention mechanism on universal feature extraction.However, it can be observed in Fig. 8(b3) that the highest attention score chiefly lies in the fifth patch, which corresponds to an outlier patch with a exorbitant dip.The visualization (Fig. 8(c  the third stage, indicating that the construction of the prediction sequence features heavily rely on the rear end of input sequence features and it fails to build up the prediction sequence in an universal manner.On contrary, the crossattention scores in the decoder uniformly distribute along the temporal dimension, which implies the preponderance of the top-down architecture in FPPformer decoder.In effect, the instance in Fig. 10. is not a particular situation.We collect the highest attention score positions of the third stage in Crossformer and the first stage in FPPformer, where the most coarse-grained features lie in, when handling the whole ETTh 1 dataset, i.e., with over 100, 000 instances.The result is shown in Fig. 11.Apparently, the highest attention score distribution of FPPformer is much more uniform than that of Crossformer, indicating the success of the top-down decoder design in FPPformer.

VI. DISCUSSION
Though FPPformer has achieved state-of-the-art performances, it still owns at least two limits: 1.The hierarchy in FPPformer can be more exquisitely devised.The 'merging' operation in the encoder of FPPformer is too simple to well represent the feature map of the bigger patch via the two smaller patch ingredients.So does the 'splitting' operation in the decoder.The cutting-edge methods to handle the combination or the split of patches, e.g., Swin-Transformer [48], in CV field, where patch-wise attention is also prevailing, can be learned, imitated and modified in timeseries forecasting Transformer.
2. Currently, the outlier is tackled via DM self-attention, which roughly mask the entire diagonal of the self-attention score matrix, in FPPformer.Notice that the outliers shall be fewer than the normal segments of time-series sequences, which implies that the majority of masked patches are indeed normal and the masking behavior can negatively influences the feature extraction of input sequences.We believe that applying a prior anomaly detection method to each input sequence before forecasting and then only masking the detected anomalous patches can be a better format of utilizing the DM self-attention.
Both of the foregoing two limits and potential solutions will be our future research directions.

VII. CONCLUSION
In this work, we attempt to further develop the timeseries forecasting Transformer from the perspective of decoder.We lucubrate the existing decoder designs, point out their drawbacks and propose our solutions.The ultimate product, i.e., FPPformer, achieves state-of-the-art performaces in multiple benchmarks, including multivariate and univariate ones, leveraging from refined attention mechanism and enhanced encoder-decoder architecture proposed by us.

Fig. 1 .
Fig. 1.A schematic of a vanilla TSFT with two-stage encoder (Green dashed box containing green solid boxes in the left) and two-stage decoder (Orange dashed box containing orange solid boxes in the right).

Fig. 2 .
Fig.2.An overview of FPPformer's hierarchical architecture with two-stage encoder and two-stage decoder.Different from the vanilla one in Fig.1, the encoder owns bottom-up structure while the decoder owns top-down structure.Note that the direction of the propagation flow in decoder is opposite to that in Fig.1to highlight the top-down structure.'DM' in the stages of encoder means 'Diagonal-Masked'.

Fig. 3 .
Fig. 3.The comparison of hierarchical architecture of FPPformer (a) and Crossformer (b).The discrepancies are highlighted with red.Obviously, the decoder structure of Crossformer is nearly a duplicate of encoder's bottom-up structure whereas the decoder of FPPformer owns a different 'top-down' structure.

Fig. 4 .
Fig.4.The changes in the size of a single input sequence when propagating through the first encoder stage.The batch size and the variable dimension are omitted.The red and blue letters in the last two sizes separately refer to the token dimension and its latent representation dimension.The reshaping operation is used to treat the features of all elements in a single patch as a unity for the sake of connecting element-wise self-attention and patch-wise attention.

Fig. 7 .
Fig. 7. Forecasting windows of FPPformer and other four TSFTs from six datasets.The black line is the ground truth, the red line is the forecasting curve of FPPformer and the rest are the forecasting curves of other TSFTs.

Fig. 8 .
Fig. 8.The visualization of DM patch-wise (b) and element-wise (c) self-attention score distributions via heat map.All figures are obtained by FPPformer or its variants with the identical input sequence (a).Specially, the element-wise attention score heat maps stem from the fifth patch, which is supposed to be anomalous and marked with red in (a).

Fig. 9 .
Fig. 9.The visualization of the feature maps of the patches in different encoder stages (a) in PatchTST (b), Crossformer (c) and FPPformer (d) via T-SNE.The points with different colors denote different patches.Only the feature map of the last encoder stage of PatchTST is shown since it is not hierarchical.

Fig. 10 .Fig. 11 .
Fig. 10.The visualization of the cross-attention score distribution of different decoder stages in Crossformer (a) and FPPformer (b) via heat map.