Skip-RCNN: A Cost-Effective Multivariate Time Series Forecasting Model

Multivariate time series (MTS) forecasting is a crucial aspect in many classification and regression tasks. In recent years, deep learning models have become the mainstream framework for MTS forecasting. Among these deep learning methods, the transformer model has been proved particularly effective due to its ability to capture long- and short-term dependencies. However, the computational complexity of transformer-based models sets the obstacles for resource-constrained scenarios. To address this challenge, we propose a novel and efficient Skip-RCNN network that incorporates Skip-RNN and Skip-CNN modules to split the MTS into multiple frames with various time intervals. Thanks to the skipping process of Skip-RNN and Skip-CNN, the resulting network could process information with different reception field together and achieves better performance than the state-of-the-art network. We conducted comparative experiments using our proposed method and six baseline models on seven publicly available datasets. The results demonstrate that our model outperforms other baseline methods in accuracy under most conditions and surpasses the transformer-based model with 0.098 for a short interval and 0.068 for a long interval. Our Skip-RCNN network presents a promising approach to MTS forecasting that can meet the demands of resource-constrained prediction scenarios.


I. INTRODUCTION
In recent years, there has been a significant increase in the utilization of deep learning techniques for analyzing multivariate time series (MTS) data in industrial scenarios [1], [2], [3], [4].MTS datasets comprise sequences of multiple measurements from various variables, typically recorded by sensors at uniform time intervals.By analyzing MTS data, researchers can perform a variety of classification or regression tasks including capacity prediction and anomaly detection [5], [6], [7], [8], and they have shown great interest in MTS forecasting tasks that can be widely applied in industrial systems, such as ITS, AIOps, Heating, Ventilation and Air conditioning(HVAC) systems [9], [10], [11], [12].
The associate editor coordinating the review of this manuscript and approving it for publication was Yiqi Liu .
MTS forecasting is a powerful tool for predicting the future values of key variables by analyzing historical dynamic data.While it shares similarities with univariate time series forecasting in terms of leveraging historical features from a single sequence, robust MTS forecasting requires consideration of additional correlations and empirical interdependencies among different temporal sequences.Given that observed variables may be interconnected and exhibit non-linear patterns, traditional methods such as the vector autoregressive (VAR) [13] model and the Gaussian process (GP) [14] model may fail to capture these patterns.The same problem also happens to the support vector machine with regression [15] and XGBoost [16].Furthermore, these statistical models may suffer from their high computation complexity when dealing with larger datasets [17], [18].
Methods based on deep learning have shown great promise in solving MTS forecasting problems by extracting both periodic and dimensional latent features from data.For example, CNN-based models are known for their ability to extract spatial features with convolutional layers and have been successfully applied in computer vision (CV) tasks [21].However, sophisticated deep-architecture models like InceptionTime [22] and ResNet [23] may suffer from gradient vanishing and exploding [24], or be limited by convolutional kernels in learning long-term patterns.Such models could even be outperformed by non-deep models like HIVE-COTE [25] and ROCKET [26].LSTM-based models are capable of capturing both long-and short-term dependencies and have achieved state-of-the-art results in natural language processing (NLP) tasks [19], [20].While transformer [27] and its variants utilize multi-head attention mechanisms to learn dependencies in the sequences, achieving impressive results in MTS-forecasting through unsupervised pre-training.Although transformer models have broken the monopoly of LSTM-based methods [28] in modeling the complex dynamics of time series data, they can still be vulnerable under certain circumstances [29].Due to the complex architectures and large number of parameters, transformer models may struggle to learn patterns with limited training data and their extensive calculations may cause over-fitting in resource-constrained scenarios.Moreover, whether the transformer model could fully capture the sequential information with positional encoding remains a question [32].
In this paper, we propose a novel and efficient Skip-RCNN model to address the aforementioned problems.The model effectively captures both interdependencies among sequences and long-and short-term dependencies with reduced computational load.The Skip-RCNN model comprises three main modules: Skip-CNN, Skip-RNN, and Autoregressive modules.The Skip-CNN module splits the MTS into multiple frames with various time intervals and extracts periodic features from these frames using convolutional layers.Simultaneously, the 1-D convolutional kernel extracts interlinked features among different sequences, which are then split according to different intervals and imported into the Skip-RNN module.A highway layer serves as the feature fusing module to concatenate different types of features extracted and accomplish the ultimate forecasting.We conducted efficacy verification work on six offline public evaluation datasets.Extensive experiments have shown that our model outperforms six baseline models in terms of accuracy under most situations.Specifically, our work mainly contributes in the following ways: • We introduce a novel and efficient Skip-RCNN model with an innovative architecture.The model consists of three main modules designed to extract periodic features and dimensional features from MTS.
• We leverage the Skip-CNN module, Skip-RNN module, and autoregressive module to incorporate interdependencies among sequences, long-and short-term dependencies, and self dependencies.This enables us to effectively model complex relationships among different variables in MTS prediction tasks.
With a vast number of sensors recording data sequences such as temperature, electricity usage, velocity, and price on a daily basis, these sequences can vary significantly over time and be interlinked with one another.Many attempts have been made to capture the changing trends of these dynamic variables [30], [31].Statistical models are widely used for their analysis capability and interpretability, such as the vector auto-regressive moving average model (VARMA) [38] and the auto-regressive integrated moving average model (ARIMA) [39], which can capture linear dependencies in MTS.Additionally, the Gaussian process (GP) [14] model, a Bayesian-based model that models the distribution of MTS, is another popular option.However, these traditional approaches may perform poorly in MTS forecasting tasks due to their strong assumptions regarding stationary processes.Deep learning has provided new ideas for MTS forecasting, as it is free from stationary assumptions and is able to capture non-linear relations.
The RNN-based network architectures have gained tremendous popularity for their impressive performance in natural language processing (NLP), a field sharing many similarities with MTS tasks.In 1997, Hochreiter and Schmidhuber proposed LSTM [28] as a strong baseline with multiplicative gate units to capture long-and shortterm information.In 2014, Chung et al. evaluated gated recurrent units(GRU) [40] and showed that GRU is superior to traditional recurrent units and can be comparable to LSTM in sequence tasks.LSTM-FCN [41] and GRU-FCN [42], introduced by Karim et [43] by converting its respective univariate model into multivariate variants, while also extending the squeeze-andexcite block to the case of 1D sequence models to enhance accuracy.
Convolutional networks [44] have been widely used in sequence tasks for several decades.Wang et al. [45] evaluated MLP, FCN, and ResNet in a time series classification task and found that these models provided strong end-to-end baselines and achieved premium performance compared with other state-of-the-art approaches.In 2019, Fawaz et al. [22] introduced the InceptionTime model, which ensembles five deep learning models created by cascading multiple Inception modules.InceptionTime can apply multiple filters simultaneously to an input time series, allowing the network to automatically extract relevant features from both long and short time series, while also being more scalable.Tan et al.
proposed MultiRocket [46], which adapts Rocket [26] by adding multiple pooling operators and transformations to improve the diversity of the generated features.This model also utilizes convolutions to extract features from both raw input series and first-order differences transformed series, making it faster and more accurate.In 2022, Fauvel et al. developed XCM [48], an explainable convolutional neural network for MTS that directly extracts information relative to the observed variables and time, and enables satisfactory generalization ability on both large and small datasets.In the same year, Shen and Wang proposed TCCT [47] with an attention module in the transformer and CNN.By applying the CSPAttention module and dilated casual convolution, the model performs better in both accuracy and speed in the benchmark dataset.
Transformer-based approaches have emerged as promising techniques for representative learning thanks to the utilization of the self-attention module, achieving remarkable performance.For instance, Zerveas et al. introduced the TST [49] model based on the transformer encoder architecture, featuring an unsupervised pre-training scheme that delivers significant performance benefits in downstream tasks, even without utilizing additional unlabeled data.Similarly, Cholakov and Kolev introduced the GatedTabTransformer [50], which was derived from TabTransformer [27], and leveraged an attention mechanism to capture the relationships among categorical features.These models utilize a standard MLP to output the final logits, achieving state-of-the-art performance in tabular datasets.Zerveas et al. [51] proposed TSiT, another novel architecture based on transformer for MTS tasks, which outperforms the contemporary SOTA and even achieved outstanding accuracy with limited data samples.Moreover, Kumar et al. [52] proves transformer models rank high in the computer vision tasks such as ViT [53] are also effective for MTS tasks, marking that neural network solely based on transformer architecture could achieve a relatively high score in MTS as well.Wendong and Jun [54] utilized the attention module and LSTM in another way by proposing a novel loss function and optimizer.They emphasized the abrupt change as well as slow change information in the time series by considering the second-order discrete difference.
Recurrent units such as GRU and LSTM are designed to capture historical information in time series, but they are limited by gradient vanishing, making it difficult to model long-term dependencies.To address this issue, Lai et al. introduced the recurrent-skip component in LSTNet [55], which effectively leverages the periodic pattern in real-world datasets.The temporal skip connections in the recurrent structure extend the temporal span of information flow, facilitating the optimization process.By adjusting the number of hidden cells skipped, the recurrent-skip module can extract different periodic patterns from the time series(hourly, daily, or monthly).Additionally, an attention mechanism is employed to learn the importance of each hidden representation and combine them.With this mechanism, the recurrent skip can perform much better in nonseasonal time series forecasting.
An autoregressive (AR) model is a statistical representation of a type of random process used to describe time-varying processes from the original signals.The AR model predicts future values using a stochastic difference equation, which specifies that the output can depend only linearly on its previous values.Together with the moving-average (MA) model, the AR model is a key component of the more general autoregressive-moving-average (ARMA) and autoregressive integrated moving average (ARIMA) models.The AR model is also a special case of the vector autoregressive model (VAR), which consists of a system of interlocking stochastic difference equations in multiple evolving random variables.
The accuracy of MTS forecasting can be significantly influenced by the trends in the input data sequences and their scale.Depending solely on the features generated by convolutional and recurrent components may lead to unsatisfactory results.To address this limitation, it is essential to consider both linear and non-linear patterns in the input data.The inclusion of a classical autoregressive (AR) model can provide the necessary linear component to the original input.As a result, the LSTNet model combines the outputs of the neural network with the AR component to improve forecasting accuracy.

III. METHODOLOGY
Thanks to its skip connections in the recurrent structure and attention module, LSTNet could efficiently extract different periodic patterns and take the lead in the MTS tasks.However, LSTNet still suffers from its large model size and training speed due to its transformer-based structure.
Inspired by LSTNet and taking one step further, we adopt multiple Recurrent-skip Component structures to capture complex periodicity and seasonality.Additionally, while LSTNet employs a skipping structure solely in the recurrent neural network, we believe that a similar technique in the early convolutional layer could also help capture periodic information.Thus, we propose the Multi-head Skip-RCNN model.
Fig. 1 and Fig. 2 show the overall structure of our model.In subsections of B to G, we describe the network structure in detail.

A. PROBLEM DEFINITION
Given a time series data Y = {y 1 , y 2 , . . ., y T } in which y t ∈ R n , and n is the feature dimension of the data, we aim to use the historical data to predict data at a certain point of time, namely to predict y i+j+h with the input {y i , y i+1 , . . ., y i+j } ⊂ Y , where h is the interval between timestamp i + j and the predicted timestamp.
For our task, the given input is shaped as N ×L 1 ×D, where N , L 1 and D mean the batch size, input window size, and total feature dimensions respectively.The output should be N × L2×D, where L 2 here means the prediction length, how much data should the network predict in the future.Note that the interval between input and output is t.For our experiment, we mainly focus on the case where L 2 = 1 as for other values of L 2 , we could adjust the interval and get the result.

B. ONE-DIMENSIONAL CONVOLUTIONAL LAYER
The One-dimensional Convolutional Layer, identified as the Conv-1D Layer in Fig. 2, performs the crucial task of merging features at each time step to enable subsequent feature extraction.This is achieved through a one-dimensional convolutional kernel, which is of the same length as the feature dimension of the data (1 × D).The kernel slides along the direction of the time series and ultimately produces an N × T matrix as output.

C. GRU BLOCK
The Gated Recurrent Unit (GRU) [56] is renowned for its capacity to extract long-term temporal dependencies.In our approach, we utilize the Rectified Linear Unit (ReLU) function as the hidden update activation function.At time t, let x t denote the input vector and h t−1 denote the previous hidden state.The update gate at time t is computed as follows: where W and U are weight matrices, σ is the sigmoid activation function, and b represents the bias term.Similarly, the reset gate at time t is calculated as: Then, the candidate memory content is calculated as: where ⊙ represents the element-wise product.Finally, the hidden state for the current time step t is calculated as: The GRU block is organized in series, and the terminal output of this layer is denoted as h GRU t , which represents the hidden state of time stamp t.It is assumed that the hidden layer size is H , meaning that |h GRU t | = H . GRU blocks receive input from a One-dimensional Convolutional Layer with a shape of N × T and return an output of shape N × H .

D. MULTI-HEAD SKIP-RNN BLOCK
The Skip-RNN Block is designed to capture long-term cyclical patterns.Similar to the original GRUs, we use the ReLU activation function.However, SkipGRUs use a skip connection method to connect the current hidden cell and the hidden cells in the same phase in adjacent periods.Let s denote the number of hidden cells skipped through.A slight modification to the GRU updating process gives us the SkipGRU updating process: To obtain the output, we need s hidden states from time stamp t − s + 1 to t, denoted as Assuming that the hidden layer size is H SR , we concatenate all these s vectors to form a vector with a shape of s × H SR .
The original LSTNet model incorporated only one skip-GRU block to capture periodic long-time dependencies.However, we discovered that these relationships can sometimes be more complex and challenging to capture at varying granularities.To address this issue, we developed multiple skip-GRU blocks with different s values, denoted as P SR  1 , P SR 2 , . . ., P SR N SR .Each multi-head skip-GRU block i accepts input of shape N × T from a One-dimensional Convolutional Layer and produces an output of shape N × P SR i H SR .Since the matrices are of different sizes due to the various s values, we concatenate them along the last dimension to form a final output of shape N × N SR i=1 P SR i H SR .We denote this output as H ′SR = N SR i=1 P SR i H SR .This block has the potential to capture more complex periodic patterns, leading to improved predictions.

E. MULTI-HEAD SKIP-CNN BLOCK
Multi-head Skip-CNN block is a structure parallel to the One-dimensional convolutional layer.It consists of several Skip-CNN blocks with N SC different skipping steps, denoted as P SC  1 , P SC 2 , . . ., P SC N SC .Each Skip-CNN block contains two layers -Skip-CNN layer and the GRU layer.

1) SKIP-CNN LAYER
A Skip-CNN layer extracts features by performing a two-dimensional convolution operation on the original input data.The size of the convolutional kernels is set to C × D. The size of the first dimension C can be set freely, while the second dimension is the same as the feature dimension of the input data.This allows it to integrate information on the full range of features while focusing on both local and short-term time-series dependencies of multidimensional features.
In a Skip-CNN layer, we first re-patch the data instead of sliding down directly on the original time series data.We denote the skipping step as S. We divide the data into M = [T /S] segments along the time dimension with equal length S. Any excess is rounded off, and the input is reshaped into N × M × S × D. We exchange the second and third dimensions to make it N × S × M × D. Finally, the kernel of size C k × D operates on the last two dimensions.We denote the size of the hidden layer (i.e., number of kernels) as H SC  1 , and the output is of shape N × M × H SC  1 × (S−C + 1).We observe that setting different skipping steps S and kernel lengths C results in a different final dimension of the 142090 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.output.To make subsequent processing convenient, we sum up the last dimension, making the final output size of this layer N × M × H SC 1 .

2) GRU LAYER
The structure of GRU here is no different from that described above.Denote the number of hidden sizes as H SC 2 .It accepts the input from the Skip-CNN layer, with batch size N and feature size H SC  1 , and returns the output of shape N × H SC 2 .In Multi-head Skip-CNN Block, we construct multiple Skip-CNN blocks with different skipping steps and finally, similar as that in Multi-head Skip-GRU Block, concatenate their outputs along the last dimension together into the shape of N × N SC H SC 2 .For convenience, let H ′SR = N SC H SC 2 .

F. DENSE LAYER
Each output from the GRU block, the Multi-head Skip-CNN block, and the Multi-head Skip-RNN block is a two-dimensional matrix with a batch size of N as the first dimension.These outputs are concatenated along the second dimension to form a matrix with the shape N × (H + H ′SR + H ′SC ), which is then used as the input to the dense layer.The dense layer outputs a matrix of size N × D, where D is equal to the feature dimension of the original input data.

G. AUTOREGRESSIVE LAYER
To address the issue of local scaling, we leverage the Autoregressive Layer as a highway.We've observed that some features in certain datasets exhibit rapid fluctuations, which can be challenging for deep networks to capture.To overcome this challenge, we incorporate a simple Autoregressive Layer that takes the original data as input and produces a matrix of shape N × D, similar to that of the Dense Layer described earlier.
The outputs of the Dense Layer and the Autoregressive Layer are summed up to produce the final output of our model.

IV. EXPERIMENT A. DATASET
We conducted experiments on six public datasets to evaluate the effectiveness of our proposed approach in MTS forecasting.Specifically, we selected the ETTh1, ETTh2, ETTm1, ETTm2, and ET datasets as they are commonly used in research on power systems.Additionally, to assess the generalization ability of our model, we included the AIOps dataset in our testing.
• ETT dataset [56]: The ETT dataset includes ETTh1, ETTh2, ETTm1, and ETTm2, which provide electricity consumption data from transfomers.ETTm1 and ETTm2 were collected every minute from transformers in different regions, while ETTh1 and ETTh2 record hourly-level data from the same transformers.
• ET dataset: The ET dataset contains electricity consumption data in kWh recorded every 15 minutes from 2012 to 2014, for n = 321 clients.
• AIOps dataset: The AIOps dataset includes 101,583 records, each with a 5-minute interval, extracted from a trading platform's backend system in 2019.It contains 20 dimensions of features, including ''Disk response time,'' ''Disk throughput,'' ''Network throughput,'' and ''Memory usage'' for the four logical partitions A, B, C, and D, as well as the number of tasks per second, the CPU load, and the response time of tasks.We will explain how to obtain these datasets in the Data Availability Statement part at the end of this paper.For each experiment, we randomly divide the dataset into the training set, validation set, and testing set at the ratio of 6:2:2.

B. EVALUATION METRICS AND BASELINES
In order to evaluate the performance of our model, we use three different metrics: mean absolute error (MAE), mean square error (MSE), and Pearson correlation coefficient (CORR).Let n denote the number of samples, and let y = (y 1 , y 2 , • • • , y n ) and ŷ = (ŷ 1 , ŷ2 , • • • , ŷn ) be the ground truth signals and the system prediction signals, respectively.Additionally, we define y as the mean of the elements of y.
The MAE measures the average absolute difference between the elements of y and ŷ.The MSE, on the other hand, measures the average of the squared differences between y and ŷ.Finally, the CORR measures the linear correlation between y and ŷ, which ranges from −1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no correlation.They are calculated as follows: We use the state from certain timestamp t + 1 to t + 128 to predict the state at timestamp t + h + 128 on the test dataset, and compare the performance of our method with six commonly used baseline methods -TPA-LSTM, LSTNet, InceptionTime, TCCT, TSiT and GRU-FCN.
• TPA-LSTM: TPA stands for Temporal Pattern Attention, the model uses filters to extract time-invariant temporal patterns, which is similar to transforming the function to frequency domain.
• LSTNet: A neural network that combines CNN and RNN, along with AR, featuring a novel recurrent-skip structure.
• InceptionTime: A model based on the Inception architecture, which consists of five different Inception networks and adapts the concept of Receptive Field.
• TCCT: A tightly coupled convolutional transformer with three proposed architectures applying transformed CNN architectures into transformer: CSPAttention, Dilated causal convolution, and Passthrough mechanism.
• TSiT: A framework based on the transformer encoder architecture, which includes an unsupervised pretraining scheme.
• GRU-FCN: A hybrid deep learning model that combines a GRU with a fully convolutional network(FCN).
142092 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.• SVM: A traditional machine learning method that tries to fit the data such that deviations beyond a certain threshold are minimized.
• XGBoost: A traditional machine learning method that uses gradient boost strategy by sequentially adding predictors to an ensemble, each one correcting its predecessor.

C. EXPERIMENTAL SETTINGS
All seven deep learning methods were implemented using the PyTorch framework.Stochastic Gradient Descent(SGD) was used as the optimizer for each method, with an initial learning rate of 10 −2 , weight decay of 5e −4 , momentum of 0.9, batch size of 32, and 30 epochs of iteration.The learning rate was decayed by 0.5 every 5 epochs.We applied grid search to find the optimized hyperparameters.The Mean Squared Error (MSE) loss function was used for training each network, and the best-performing model on the validation set from all epochs was used for testing.Without losing generalization, The input window size is set to 128.Furthermore, we performed pre-processing of the raw data by obtaining normalized data.To achieve this, we subtracted the means from each dimension and then divided them by their respective variances.We uniformly trained, validated, and tested each model with the preprocessed data.
As for the baseline models, we adapt the network from tsai [57].

D. RESULTS
We conducted multiple experiments by varying the interval between the time stamp of input data and the prediction result.In this paper, we present the results for intervals of 1 and 16.
Fig. 3 and Fig. 4 show the prediction results of our proposed Skip-RCNN and other baseline models compared with the ground truth of a randomly selected feature(Low Useless Load, LULL) from the ETTm1 dataset for both interval values.The red line represents the prediction result, while the blue line represents the ground truth.

1) COMPARISON OF SKIP-RCNN WITH DIFFERENT SKIP STEP
As the skip step is one of the most that needs to be predetermined for specific tasks, we conduct experiments to compare the same Skip-RCNN model with different skip on the ETTm1 dataset and we find that choosing skip step = 36 would give the best performance (see the Figure 5).For simplicity, the Skip-RNN and Skip-CNN share the same skip step.Still, the skip step would affect the model performance in a significant way as a small step size would trap the model in local information while a model with a large step size would neglect certain features appearing with a short amount of period.

2) COMPARISON OF SKIP-RCNN AND OTHER BASELINE MODELS
Table 1 shows the MAE, MSE, and CORR values for the ground truth and prediction results obtained from our proposed Skip-RCNN and baseline methods on seven public datasets.Here we choose skip step = 24 due to its outstanding performance as shown in previous section.We have used normalized results to facilitate comparison.Table 2 shows the mean ± std of MAE, MSE, and CORR values calculated over all seven datasets for each method using the results presented in Table 1.
From Table 1, we observe that our method consistently outperforms the baseline methods in terms of MAE, MSE, and CORR for the ETTm1 and ETTm2 datasets.However, our method fails to perform well on the ETTh1 and ETTh2 datasets when the interval is set to 16.Additionally, when the interval is set to 1, the CORR value for ETTh2 is 0.896, which is worse than that of TPA-LSTM.The difference between ETTh1 and ETTm1 is only the granularity of time stamps, and the same applies to ETTh2 and ETTm2.We conclude that when the granularity of time stamp is increased, our model's performance degrades faster than that of the other models.We argue that when the temporal granularity is small, the same periodic laws take longer time steps, and models TABLE 1.Comparison of MAE, MSE, and CORR between the predicted and ground truth values for the following datasets: ETTh1, ETTh2, ETTm1, ETTm2, ET and AIOps.The comparison was conducted using our Skip-RCNN(skip=24) and other baseline models with interval values of 1 and 16.The best performance in each metric is in bold font.The networks would take the input composed of the previous 128 data and give one prediction after the corresponding interval.

TABLE 2.
The mean ± standard deviation of MAE, MSE CORR between the predictions and ground truth for Skip-RCNN and other baseline models with the interval equal to 1 and 16, calculated over ETTh1, ETTh2, ETTm1, ETTm2, ET and AIOps datasets.The best performance in each metric is in bold font.The networks would take the input composed of previous 128 data and gives one prediction after the corresponding interval.such as LSTNet tend to get caught in local dependencies and cannot memorize periodic dependencies at multiple granularities.Our model solves this problem effectively with multiple hopping structures and takes the lead in this scenario.However, when the time granularity is already coarse and the periodicity pattern is single, LSTNet and other models are able to capture this information well, and our model no longer has an advantage and even faces over-fitting.
For the ET and ER datasets, our model performs best on all metrics when the interval is 1.When the interval is 16, we obtain the best MSE (0.156) and CORR (0.910) for ET and the best CORR (0.937) for ER.Based on these results, we conclude that our method performs better when the predicted timestamp is closer to the input timestamp, and the prediction advantage decreases significantly when the predicted timestamp is gradually moved away.Interestingly, our approach fails in all cases on the AIOps dataset.We will discuss this dataset further in our ablation experiments.This can be attributed to the fact that this dataset does not have a stable long-period pattern compared with other datasets, which causes our additional means of feature extraction to backfire.
From Table 2, we find that our method has the lowest average MAE and MSE values and the highest average CORR value across all cases.The variance of our CORR values is the smallest at an interval of 1, while the variance of our MAE and MSE values is the smallest at an interval of 16.Thus, our method exhibits high accuracy and stability.

3) COMPARISON OF MODEL SIZE
The number of parameters and flops of each method are calculated and listed in Table 3.
As is shown in Table 3, given that our model is adapted from LSTNet, which is a relatively large model, the total number of parameters are larger compared to other lightweight models.As for the computation time, our model gains a considerate improvement from the original LSTNet and achieves a comparable running flop requirement among other lightweight models.Therefore, we can conclude that despite the relatively large model size, our proposed Skip-RCNN still could have a wide application in real-time or resource-constrained scenarios.

E. ABLATION STUDY
To evaluate the contribution of each component of our proposed model, we conducted an ablation study by removing one of the Multi-head Skip-CNN block (SC), Multi-head Skip-RNN block (SR), and Autoregressive Layer (AR) from the model.We trained and tested the models in the same way and with the same data as the full model.The performance differences between the ablation models and the full model were compared, as shown in Table 4, with the interval set as 16.
Our proposed model achieved the best MAE, MSE, and CORR for ETTh1, ETTh2, and ETTm2 datasets, with the best MAE and MSE but lower CORR than some of the ablation models.On the ET dataset, it achieved the highest CORR but failed to perform well in MAE and MSE.Notably, the performance of all three metrics dropped significantly in all cases when the AR module was removed.These results suggest that the Skip-CNN and Skip-RNN modules improve model performance across the board, sometimes trading off between MAE, MSE, and CORR, while the Autoregressive module always excels in improving the predictive power of the model.We hypothesize that the Autoregressive module is especially important because these datasets have a scaling problem, with data fluctuating drastically in a short period of time.Without the AR module, the model cannot capture this local information well.
Moreover, on the AIOps dataset, all three metrics improved after removing the SC module.Combined with Table 1, this again confirms that there is no long-period pattern in this dataset that fits our expectation, making our feature extraction act as an over-fitting backfire.

V. CONCLUSION
In this paper, we have addressed the challenging problem of feature capture for long-time series multi-granularity periodic and seasonal data.Our proposed Skip-RCNN model, with multiple step lengths, has demonstrated superior performance over several baseline methods.This is particularly true when dealing with fine-grained time data, and complex and diverse periods.
However, one limitation of our approach is the need to manually optimize the step size.To address this issue, we plan to explore automatic machine learning techniques to select the appropriate step size for different datasets.With these enhancements, we believe our approach can be further improved and extended to a wide range of applications.

FIGURE 1 .
FIGURE 1. Main model structure of our proposed method, containing Skip-RNN, Skip-CNN, and highway autoregressive module.

FIGURE 2 .
FIGURE 2. Model layouts for Skip-CNN module and Skip-RNN module.Where D means the dimension of features, N means batch, T means the total timestamp and H SC 2 is the feature number after Skip-CNN and Skip-RNN.

FIGURE 3 .
FIGURE 3. The prediction results of our proposed Skip-RCNN, LSTNet, TPA-LSTM, and GRU-FCN are separately compared to the ground truth of a randomly selected feature from the ETTm1 dataset, using interval settings of 1.The prediction results are shown by the red line, while the ground truth is represented by the blue line.Y axis represents the LULL (Low Useless Load.).

FIGURE 4 .
FIGURE 4. The prediction results of our proposed Skip-RCNN and other baseline models are separately compared to the ground truth of a randomly selected feature from the ETTm1 dataset, using interval settings of 16.The prediction results are shown by the red line, while the ground truth is represented by the blue line.The y-axis represents the LULL (Low Useless Load.).

FIGURE 5 .
FIGURE 5. Comparison of the performance of Skip-RCNN using different step sizes.The y-axis corresponds to the CORR of ETTm1 dataset while the x-axis is the step size we use.

TABLE 3 .
Model efficiency comparison of Skip-RCNN and other baseline models on two major metrics: 1. number of parameters and 2. flops to calculate the inference of one batch of data in AIOps dataset.

TABLE 4 .
Performance comparison of Skip-RCNN and modified variants of our model in terms of MAE, MSE and CORR between predicted and ground truth values for ETTh1, ETTh2, ETTm1, ETTm2, ET and AIOps datasets.The performance metrics were computed with an interval of 16.The modified variants excluded the SR, SC, and AR modules from our original model, respectively.The best performance in each metric is in bold font.SR is short for Skip-RNN, SC is short for Skip-CNN and AR is short for autoregresive.