Multi-Task Learning of the PatchTCN-TST Model for Short-Term Multi-Load Energy Forecasting Considering Indoor Environments in a Smart Building

Energy consumption in buildings contributes to over a third of global energy consumption and 28% of greenhouse gas emissions. With urbanization and population growth, rising building energy demand can lead to environmental degradation. While significant renewable resources are used to generate electricity to mitigate environmental problems, demand-side management remains crucial for achieving net-zero emissions and enhancing energy efficiency. Accurate building load forecasting is pivotal in devising optimal demand response schemes to shift or reduce the demand on power grids. Recent studies have achieved progressive breakthroughs in building energy forecasting through machine learning algorithms. However, most studies focused on building-level energy forecasting rather than individual load forecasting, which cannot support controlled demand response programs. In this study, we propose a multi-task learning model incorporating Patch, Temporal Convolutional Network and Time-Series Transformer (PatchTCN-TST) based on the channel-independent strategy for floor-level multiple electricity loads and indoor environmental forecasting. The PatchTCN-TST model is implemented to predict future data ranging from one-step ahead to three-step ahead on a real-world office building in Bangkok, Thailand. The experiment results indicate that the prediction performance of our model outperforms the prevalent methods, including LSTM, GRU, TCN, Transformer, Informer and Autoformer. The PatchTCN-TST model demonstrates superior accuracy in three forecasting scenarios, significantly reducing MAE, MSE, RMSE, and aSMAPE by 34%, 23%, 12%, and 36.4%, respectively, compared to the best baseline model.


I. INTRODUCTION
Over the past two decades, global annual energy consumption has risen by 41%, attaining 178,889 TWh due to accelerated industrial, infrastructure, and economic growth [1].The predominant reliance on non-renewable energy sources, coupled with the rising energy demand, has resulted in an exponential increase in greenhouse gas emissions.Such emissions contribute to anthropogenic environmental issues such as global The associate editor coordinating the review of this manuscript and approving it for publication was Tao Huang .
warming, sea-level rise, and ocean acidification [2], [3], [4].Considering the finite reserves of non-renewable resources and heightened environmental concerns, there has been an augmented integration of Renewable Energy Sources (RESs) into the power grid [5].
While RESs present an eco-friendly, abundant, secure, and scalable solution, their power generation can be fluctuating, intermittent, and non-dispatchable.The high penetration of RESs can cause disparities between power generation and consumption, threatening power system stability [6], especially with the large-scale installation of wind turbines and photovoltaic (PV) panels.To overcome the limitations of RESs, an array of Energy Storage Systems (ESSs) have been employed in the microgrid.ESSs enable energy storage and subsequent discharge when required, which meet energy demand, improve power quality, and augment grid flexibility [7].Additionally, Demand-Side Management (DSM) is also crucial in ensuring power grid stability and improving energy efficiency [8], [9].
The International Energy Agency (IEA) indicates that the building sector accounts for 30% and 28% of global energy consumption and greenhouse gas emissions in 2022 [10].Consequently, building energy management is a critical component of DSM for global energy-saving and carbon neutrality [11].Generally, building types can be broadly categorized into residential, commercial, office, and industrial.It should be noted that energy usage patterns in buildings vary due to the building types and operational schedules.Accurate forecasting of building energy demand is crucial within DSM, facilitating the implementation of optimal control strategies and promoting energy efficiency.Building energy consumption forecasting can be categorized into ultra-shortterm, short-term, medium, and long-term, with forecasting horizons ranging from hours to years [12].
In recent years, numerous methods based on Machine Learning (ML) models have been used for building energy forecasting [13].However, most scholars and pundits have concentrated on building-level energy consumption forecasting, which predicts the entire building's energy consumption.Although the forecasting performance improved with advanced algorithms, achieving optimal energy efficiency control is still challenging due to the limited knowledge of individual load consumption forecasting.To fill this gap, we propose a PatchTCN-TST model for floor-level multiple loads consumption and indoor environmental forecasting in smart buildings.Compared with previous studies, our proposed model forecasts multiple loads consumption rather than total building energy consumption for energy management and optimal scheduling.Among previous multivariate forecasting, our proposed model is based on the channelindependent strategy, assuming each load pattern is unrelated.The main contribution of this study can be summarized as follows: 1) We applied TCN residual block for scalar projection rather than traditional CNN projection in input representation.The main difference between TCN and CNN is that TCN uses casual convolutions to ensure the model extracts features from previous and current information, which is more reasonable in sequential modeling.2) Our multiple loads consumption and indoor environmental forecasting method is based on multivariate data, which belongs to multi-channel input and multichannel output.Most models that extract features through adopt the channel-dependent strategy from all time-series data.However, it is proven that the performance of cross-channel fusion in time-series forecasting is worse than channel-independent [14], [15].Hence, the channel-independent strategy is adopted to extract features from single-channel information but share the same parameters.3) In order to understand the relationship between subseries, patching is applied before feeding into the attention mechanisms, which split the input sequence into several sub-series.Unlike most previous models, patching allows the model to extract locality and comprehensive information from sub-series rather than point-wise dependency from entire series like Transformer.Moreover, the patching operation could significantly reduce the computational complexity of the attention mechanism.
The remainder of this paper is organized as follows.Section II introduces the recent literature on building energy forecasting.Section III presents the PatchTCN-TST model for multi-task learning in multiple loads and indoor environmental forecasting.In Section V, the case study and hyperparameter configurations are presented.Section VI analyzes the forecasting performance of our model and other benchmark models, and Section VI summarizes our study.

II. LITERATURE REVIEW
Traditional building load forecasting methods rely on physical-based models, which simulate the building energy consumption based on physical properties and environmental parameters without historical energy consumption data.Various simulation software tools, such as Energy-Plus, eQUEST, and ESP-r, have been developed for building energy consumption forecasting [16], [17].However, constructing physical-based models is time-consuming, requires numerous variables, and the forecasting performance relies on domain expertise, which constrains real-time DSM.With the advocacy of smart buildings, advanced sensors and energy monitoring units are deployed to harvest indoor environmental and energy consumption data [18], [19].In light of the abundant data availability and the rapid advancement of ML and Deep Learning (DL) techniques, data-driven models have recently attracted more attention for building energy forecasting [20].Data-driven models forecast future energy consumption using historical time-series energy consumption and exogenous variables without any intricate building specifics.The data-driven methods of building energy consumption forecasting can be divided into univariate and multivariate based on the number of input features in the model [21].Univariate methods predict energy consumption only using energy consumption data, while multivariate methods forecast energy consumption by incorporating additional energy-related features, such as humidity, temperature, and wind ambient light.
Early data-driven models utilize statistical methods for building energy consumption forecasting, which primarily include Autoregression (AR), exponential smoothing, Autoregressive Integrated Moving Average (ARIMA), and Seasonal Autoregressive Integrated Moving Average (SARIMA).Vu et al. introduced an AR model with time-varying components for short-term energy demand forecasting, which gained the best performance compared with five benchmark models [22].Sen et al. employed the ARIMA model to forecast the energy consumption in India's iron sector [23].Fang and Lahdelma applied SARIMA combined with linear regression to predict the heat demand based on multivariate data [24].
Although statistical models perform well in stationarity and linear time-series forecasting scenarios, they cannot accurately predict building energy consumption in nonlinear and intricate temporal patterns.To discover the nonlinear energy consumption patterns, ML-based models have been widely adopted to forecast building energy consumption, such as Random Forest (RF) [25], Artificial Neural Networks (ANNs) [26], and Support Vector Regression (SVR) [27].Wang et al. used RF to predict the hourly building energy consumption on two institutional buildings at the University of Florida [28].It showed superior forecasting performance compared with the regression tree model and SVR.Lahouar and Ben Hadj Slama forecast day-ahead load consumption in Tunisian Power Company through refined inputs and RF [29].The experiment results showed that the expert feature selection method improved the RF forecast performance.Ahmad et al. developed an ANN model for hourly HVAC energy consumption forecasting in a hotel using multivariate time-series data, including outdoor air temperature, humidity, and wind speed [30].Yang et al. combined the SVR with k-shape clustering to improve the forecasting accuracy in different types of ten institutional buildings [31].
Furthermore, advanced algorithms with more sophisticated structures have been proposed to enhance building energy consumption forecasting accuracy, such as Convolutional Neural Network (CNN) [32], Recurrent Neural Network (RNN), Long short-term memory (LSTM), Gated Recurrent Unit (GRU), Temporal Convolutional Network (TCN).Aurangzeb et al. grouped energy customers via the DBSCAN clustering algorithm and built a Pyramid-CNN model to forecast the power load [33].The proposed Pyramid-CNN obtained the best score on MAPE compared with other ML algorithms.Tan et al. carried out Multi-task Learning via the LSTM (MTL-LSTM) model for predicting total load and electricity, heat and cooling loads [34].Kim and Cho proposed a CNN-LSTM model based on multivariate time-series data to predict residential energy consumption [35].Compared with other methods, the evaluation results showed the best performance in CNN-LSTM under different time resolutions.Sajjad et al. proposed a CNN-GRU framework for short-term building energy prediction and evaluated the model in two datasets [36].Lemos et al. applied the TCN model for monthly energy consumption forecasting in eight different types of buildings [37].TCN uses dilated causal convolutions to extract features from past information and  a broader range, which are more reasonable for time-series tasks [38].
Moreover, various Transformer-based models have been proposed for sequential tasks in the past five years [39], [40], [41], [42], [43], which performed better than previous algorithms.In the building energy forecasting area, Zhao et al. employed a Transformer model combined with K-Means and Light-GBM for day-ahead load forecasting [44].In [45], authors proposed a Multiple-Decoder Transformer (MultiDeT) model for day-ahead multienergy load forecasting, which comprises one encoder and multiple decoders.Jiang et al. proposed a Deep-Autoformer that decomposed the series into seasonal and trend parts and designed an auto-correlation mechanism that enables the model to discover the series-wise dependencies for day-ahead residential load forecasting [46].Their proposed method achieved the best results compared to the five basic models.Given the superior performance of Transformer-based models in building energy consumption forecasting, this paper proposes a PatchTCN-TST framework, providing State-Of-The-Art (SOTA) results for a baseline model in smart building multi-load and indoor environments forecasting.

III. PROPOSED METHOD
This section proposes a novel PatchTCN-TST model for multiple loads and indoor environment forecasting.Our proposed model comprises three main parts: patching operation, TCN embedding, and a TST-based stacked encoder structure.For and MTL frameworks are depicted in Fig. 1 and 2, respectively.As illustrated in Fig. 1, STL requires several models for each task, which can be time-consuming and computation resource waste.In contrast, MTL is an efficient method by shared parameters to train a unified model for multiple tasks concurrently.
Focusing on the multi-load energy consumption and indoor environmental forecasting, we propose an end-to-end DL-based PatchTCN-TST model without external feature selection and extraction.The framework of the model is shown in Fig. 3, which consists of instance normalization and De-normalization, patching, embedding, and encoder-only structure.PatchTCN-TST predicts multi-load energy demand and indoor environments based on the channel-independent strategy, which individually forecasts multi-load demand for each univariate series.First, the univariate load data is normalized by mean and standard deviation, then segmented into several sub-series.In the embedding stage, the TCN block extracts temporal features and transforms each sub-series to align with the desired model dimensions.Then, positional information is incorporated into the embedded sequence to facilitate the model in discovering the relative positions.Subsequently, stacked Transformer encoders and a linear layer are employed for predicting the normalized single-load energy demand.Finally, multiple individual univariate data are De-normalized and concatenated to generate the multivariate outputs.

A. INSTANCE NORMALIZATION AND DE-NORMALIZATION
In real-world scenarios, most energy load consumption exhibits non-stationarity characteristics.Many studies intended to transfer the series data to stationarity to improve forecasting performance.However, recognizing and accounting for the inherent non-stationarity is essential for accurate forecasting, as it enables the model to capture temporal dependencies effectively.To mitigate the impact of non-stationarity, our model incorporates normalization and de-normalization modules: transfer the non-stationarity series to stationarity at first and revert the initial non-stationarity at the end to obtain the final outputs, respectively.For multi-load and indoor environmental data }.The normalization operation can be defined as follows: where µ X , σ X ∈ R M ×1 represent the mean and standard deviation of each measurement unit.

B. PATCHING
Most Transformer-based models commonly employ attention mechanisms to extract point-wise dependencies, which mainly focus on the relationship between points and easily disregard the past series-wise dependencies.Consequently, the scale of point-wise attention has limitations in timeseries tasks, as the current data is related to historical data.
Considering the constraint of attention mechanisms, we adopt the patching operation to segment the input sequence into several patches regarded as partial features, as shown in Fig. 3.For univariate input sequence X i ∈ R 1×L , patch length P and stride S are used to segment the padded univariate data X i padding ∈ R 1×(L+S) where the last boundary value was repeated S times at the end of the input sequence X i .The generated patches can be denoted as X i patch ∈ R P×N where N = ⌊(L − P)/S⌋ + 2 represents the patch number.The patching process is visualized in Fig. 4.Moreover, patching accelerates the computation in attention mechanisms, which shortens the input length L to L/S, approximately.

C. TOKEN EMBEDDING AND POSITIONAL ENCODING
In the token embedding stage, the vanilla Transformer and its variant employ 1D-CNN to transform the input sequence into the embedding dimension d model .However, it should be noted that the output of 1D-CNN at time t is a mapping from the past, current and future data, which is inappropriate for forecasting tasks.To exclude future data, we select the TCN block for token embedding, which consists of two casual convolutional layers, and the detailed structure is shown in Fig. 5.The TCN block operation can be defined as: where F (•) and Conv 1×1 (•) represent the casual convolutional and residual connection part.ReLU (•) is the Rectified Linear Unit (ReLU) as an activation function.The TCN block uses casual convolution that the output of time t is only mapping from the elements of time t and past t.In order to speed up the training process and improve generalization, weight normalization is applied to reparameterize the weights within a fixed range after each casual convolutional layer [47].The casual convolution operation at time t and weight normalization can be defined as follows: where * denotes the convolutional operation, k is the kernel size in casual convolution, w, w norm denotes the original and normalized weight, ∥v∥ and g are the Euclidean norm and scalar of the original weight.
In Transformer-based models, attention mechanisms are used to capture the relevant relation to different elements but neglect temporal information.However, the inherent ignorance of temporal information in attention mechanisms can degrade the model performance in sequence tasks.To address this issue, we use sine and cosine functions as positional encoding to add sequential information in time-series inputs, which can be formulated as follows: where t is the time t position and i ∈ {0, 1, . . ., d model − 1} represents i th dimension.
The above equations can be summarized as: where l X is the length of input X , and TCN (•), PE(•) represent token embedding and positional encoding, respectively.

D. TRANSFORMER ENCODER
As depicted in Fig. 3, our model consists of N encoder stacked encoder blocks.Each block comprises two sub-layers with residual connection followed by layer normalization: multihead attention mechanism and feed-forward layer.Hence, the output of i th sub-layer in n th encoder can be expressed as: where X n i , X n i+1 denote the input and output of i th sub-layer in n th encoder, i ∈ {1, 2} represents the attention mechanism and feed-forward layer, respectively.
In the first sub-layer, we select the multi-head self-attention mechanism to capture the dependencies between different points, and the structure is shown in Fig. 6.The embedded input , and values V h ∈ R N ×d v for each head h ∈ {1, 2, . . ., H } where d q = d k .For h th head, the weights on values are obtained from the dot-product between Q h and K h .While the large value of d k increases the value of the dot-product, which causes small gradients after the softmax function.To fix this impact, we rescaled the outputs of attention scores by divide √ d k .The outputs of h th scaled dot-product attention can be formulated as: The outputs of each attention head are concatenated and projected by W O to obtain the final outputs of multi-head attention, which can be calculated as follows: where is the trainable projection parameter matrix.
The second sub-layer is the feed-forward network, and the internal structure of the feed-forward network is shown in Fig. 7.It contains two 1×1 convolutional feed-forward layers that process each position separately with different convolutional filters.The equation of the feed-forward network can  be expressed as: where

IV. CASE STUDY
In this section, we compare the other data-driven benchmark models, including LSTM, GRU, TCN, Transformer, Informer, and Autoformer to verify the performance of the Patching X ′ padding using patch length P and stride S to obtain the N patches sequence

Parameters: For i ∈ [E]:
For h ∈ [H ]: / * h th head attention parameters for i th encoder * / , parameters of concatenated multi-head for i th encoder

Reshape and flatten X
Linear layer.Parameters:

A. DATA DESCRIPTION AND PREPROCESSING
The experimental data (CU-BEMS) were obtained from a large-scale seven-story office building in Bangkok, Thailand [48].It collected electricity consumption and indoor environmental data at one-minute intervals from 33 zones between July 1, 2018 to December 31, 2019.Each zone includes the electricity consumption data for individual lighting, plug loads, and air conditioning (AC) units, along with indoor environmental measurements of ambient light (lux), relative humidity (%), and temperature ( • C) by multi-sensors.building contain 55 AC units, 33 lights, 32 plug loads, and 72 sensors, and the corresponding distribution are summarized in Table 1.
Data preprocessing plays an essential role in achieving accurate forecasting owing to the presence of outliers and missing values data during measurement, transmission, and storage.The extreme outliers are identified and removed using box plot analysis, followed by replacement through linear imputation.Missing values are also filled by the linear imputation method.However, Floor 6 is excluded from our research due to the significant amount of missing data.Subsequently, the one-minute interval data are upsampled to an hourly interval.Fig. 10 (a) and (b) display the weekly electricity consumption patterns for lighting and plug load on Floor 5, Zone 1-5.For each zone, the electricity consumption trends for lighting and plug loads are similar on weekdays but with varying magnitudes, while the electricity usage patterns are slightly different on weekends.After that, the first 70% of the data is used for the training set, and the remaining 10% and 20% are used for the validation and test sets, respectively.Additionally, to eliminate the magnitudes of measurements, z-score normalization is applied to rescale the original data into a fixed range, which can be expressed as: where x represents the original measurement data and x ′ is the normalized data, µ and σ are the mean and standard deviation values of each measurement.Moreover, the sliding window method is utilized to obtain the corresponding historical and future data to satisfy the form of input and output in the forecasting model, as shown  in Fig. 11.At time T d , sliding window method takes the past d hours of multi-load and environmental data as the input and takes the future N hours data as the actual data.The window slides with the stride s until collect the all data.

B. EXPERIMENTAL SETUP AND HYPERPARAMETER TUNING
In this study, we select the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and adjusted Symmetric Mean Absolute Percentage Error (aSMAPE) metrics to evaluate the forecasting performance, which can be expressed as follows: ) where y i j , ŷi j ∈ R M ×1 represent the predicted and actual values at time j, N denotes the number of forecasting data, L is the multi-step forecasting length, and the positive coefficient ε =1 is added to eliminate the limitation when both forecasted and actual values are close to zero.
In the model training phase, we employ the Adam optimizer with adaptive learning rates to update the model parameters based on MSE loss.In order to avoid overfitting and accelerate the training process, we apply the early stopping method with the patience of 3 epochs and set the batch size to 32.Furthermore, to achieve the best forecasting performance, we conduct experiments to investigate the impact of several hyperparameters and identify the optimal value of hyperparameters.These hyperparameters include input length L input , encoder number N encoder , model dimension d model , initial learning rate lr, dimension of feed-forward network d ff , and dropout rate r.The experimental results for various hyperparameters and configurations are presented in Fig. 12.It is evident that both MAE and MSE exhibit similar trends as the hyperparameter values vary, except for the number of encoders.As the incremental length of the input, the reduction of MAE and MSE indicates improved forecasting performance and obtained the best results at an input length of 30.Additionally, the forecasting accuracy improves with the larger model dimensions, while increased model complexity results in computational resource wastage and time-consuming.Moreover, in Fig. 12 (d) and (f), forecasting results deteriorate with higher learning rates and dropout values.Consequently, a combination of optimal hyperparameters, L input = 30, N encoder = 2, d model = 512, lr= 0.0001, d ff = 64, r= 0.05, is utilized in our proposed model.

V. COMPARATIVE ANALYSIS
To verify the forecasting performance of PatchTCN-TST, six deep learning models are conducted for comparison, namely LSTM, GRU, TCN, Transformer, Informer and Autoformer.The hyperparameters of comparison models are listed in Table 2. Concretely, LSTM and GRU are recurrent models, and TCN is a variant architecture of the convolutional model that allows parallel computation.Transformer, Informer, and Autoformer belong to the encoder-decoder structure with different attention mechanisms.All of the mentioned models are trained and tested on an Intel Xeon CPU, 64 GB RAM, and NVIDIA A100 40GB GPU using Python 3.8.18 with Pytorch 2.0.1 framework.
Fig. 13 (a)-(c) present a comprehensive evaluation result of various models' forecasting capabilities across different prediction horizons, which contains one-step ahead, two-step ahead, and three-step ahead predictions, respectively.As evidenced by Fig. 13, the minimal metric error values of PatchTCN-TST indicate that our proposed model surpasses the other benchmark models across the three scenarios with all three metrics.This can be attributed to the usage of channel-independent strategy and patching operation.For two-step ahead predictions, the forecasting performance of the TCN model is better than other compared models.The recurrent models, namely LSTM and GRU, obtain the highest error rates across three forecasting horizons.Meanwhile, the encoder-decoder models of Transformer, Informer, and Autoformer demonstrate similar errors in each scenario, which might be inappropriate for MTL.Quantitatively, compared to the best benchmark model, PatchTCN-TST yields 34%,   best results are highlighted in bold.Among three forecasting scenarios, our model consistently outperforms others across different floors.In Table 3, the PatchTCN-TST model demonstrates superior performance on floor 2 with the most feature numbers compared to other floors, which indicates that the forecasting capability of our model is unaffected by the number of features.In contrast, LSTM and GRU obtain similar errors across the floors but generate the worst results, especially on floor 2, which may have potential difficulties in handling multi-channel data.The TCN and Transformer-based models are improved because of the more sophisticated structure and residual connections.Generally, as the prediction horizons increase, the forecasting accuracy diminishes accordingly.From the results of two-step ahead to three-step ahead forecasting in Tables 4 and 5, the metric errors of our model and Transformer-based models increase slightly.Moreover, the three-step ahead forecasting performance on floor 5 of PatchTCN-TST, Transformer and Informer are better than two-step ahead forecasting.
To present the forecasting result, Fig. 14 and 15 show the one-step-ahead power forecasting on AC1 and indoor temperature in two weeks under the different forecasting models.Among different color curves, the bold purple and red curves represent the actual data and forecasting results of our proposed model, respectively.From the two-week AC1 power utilization in Fig. 14 (a), the electricity consumption pattern is based on a daily period with minor variations in electricity consumption magnitude.Among all forecasting results, our proposed PatchTCN-TST model is more accurate than all comparison models.Regarding recurrent models and    TCN, they follow the daily trend but cannot precisely predict the power magnitude, which overestimates and underestimates the power demand, especially on the 5 th , 6 th , 7 th , 9 th, and 10 th days in Fig. 14 (a).In terms of Transformer-based models, it can be observed that the overestimated predictions disappear in Fig. 14 (b), but underestimates still exist on the 9 th and 10 th days.Moreover, the undesired fluctuations occur when the AC1 is powered off, while the Autoformer exhibits the most severe.The corresponding indoor temperature forecasting results on Floor 2, zone 1 are shown in Fig. 15.The temperature degree varies within a certain range and decreases as the air conditioner powers on.Obviously, the predictions of our model could still closely follow the actual temperature trend and obtain the 0.067 MSE score during the plotted two weeks.TCN model also exhibits the admired temperature forecasting results and reaches a 0.0889 MSE score, which is superior to other comparison models.
Fig. 16-19 depict the randomly chosen two-week forecasting results under two-step-ahead and three-step-ahead forecast horizons.Overall, six benchmark models demonstrate drastic deviations from the actual data, while the PatchTCN-TST model could still follow the real variation.For two-step-ahead forecasting, the recurrent models and TCN show sharp fluctuations during the power-off period in Fig. 16 and 17, which does not occur in one-step-ahead forecasting.In Fig. 16 (b), the forecasting results of Informer and Autoformer indicate that the models cannot capture the weekly energy usage pattern.In the predictions of these two weeks, the MSE values of the informer and autoformer models are more than ten times higher than the PatchTCN-TST.From Fig. 18 and 19, the forecasting drawbacks of benchmark models are more evident than two-step-ahead forecasting, especially during the power-off period.

VI. CONCLUSION
This paper proposes an encoder-only PatchTCN-TST model based on the channel-independent strategy for floor-level multi-load and indoor environmental forecasting of an office building in Bangkok, Thailand.The experimental results of hourly forecasting, ranging from 1-h to 3-h ahead, demonstrate that the proposed PatchTCN-TST outperforms six baseline models with smaller evaluation metrics in terms of MAE, MSE, RMSE and aSMAPE.Based on the forecasting performance and evaluation metrics among different models, the superior performance of PatchTCN-TST can be summarized as follows: 1) The channel-independent strategy enables the model to predict multivariate data separately but with shared parameters for multivariate forecasting.2) Patching operation is applied to segment the input sequence into several sub-series, which enhances the extraction of local dependencies and reduces computational complexity in the attention mechanism.3) Additionally, the TCN block is used to extract the temporal features rather than the original CNN in embedding.
The poor performance of the baseline models proves that the channel-independent strategy is the main contribution to our methodology.Most models, especially the previously proposed Transformer-based models, primarily focus on multi-step ahead forecasting while neglecting the impact of the mixed channel information, which significantly hinders the forecast performance in multivariate forecasting.
As our method achieves favorable results in multivariate forecasting, one of the most pivotal aspects is the utilization of the channel-independent strategy.On the other hand, the channel-independent strategy disregards interchannel relationships, such as the strong correlation between AC power and temperature.In future work, we plan to enhance our models by considering high-correlation features rather than relying solely on a channel-independent approach.

FIGURE 1 .
FIGURE 1.The framework of single-task learning.

FIGURE 2 .
FIGURE 2. The framework of multi-task learning.

FIGURE 8 .
FIGURE 8.The flowchart of our experiments.

W 1 ,
b 1 and W 2 , b 2 represent the parameters of the first and second convolutional layer, σ (•) is the activation function.After the stacked encoders, the outputs are flattened and pass through the linear layer to obtain the normalized single-load demand forecasting results ŷi ′ ∈ R 1×T .Afterward, each normalized single-load demand forecasting data is concatenated and De-normalized into the final multi-load forecasting results Ŷ ∈ R M ×T .The pseudocode of the PatchTCN-TST forecasting model is summarized in Algorithm 1.

Algorithm 1 3 Instance 4 return 5 6
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Pseudocode for the PatchTCN-TST model Input: Mini-batch B of historical M features of multiple loads and indoor environmental data for previous L times, X ∈ R B×M×L ; Output: Mini-batch B of predicted future T time steps data of corresponding M features, Ŷ ∈ R B×M ×T ; 1 1 − D Instance Normalization 2 Calculate the mean µ x and standard deviation σ x .µ x , σ x ∈ R B×M ×1 Normalization transform X ′ = (x i − µ x ) √ σ x + ϵ normalized mini-batch data X ′ ∈ R B×M ×L Patching Pad boundary value on the right side with the size S, X ′ padding ∈ R B×M ×(L+S) 7 Obtain the final result by De − normalization De-normalize the Ŷ ′ according to the corresponding mean µ x and standard deviation σ x from Step 1 return forecasting results Ŷ ∈ R B×M ×T proposed TCN-PatchTST.To demonstrate the excellence of the model proposed in this study, an experimental workflow was established as shown in Fig. 8.This workflow mainly includes preprocessing, data splitting, and model training and evaluation.

Fig. 9 (
a) and (b) present the floor plans on Floor 1-2 and Floor 3-7, respectively.The red dots on floor plans denote the installation of multi-sensors, except for Floor 1, without environmental sensors.The overall measurements of the office

FIGURE 9 .TABLE 1 .
FIGURE 9.The floor plans of smart office building.

FIGURE 10 .
FIGURE 10.The electricity consumption of lighting and plug load in one week.

FIGURE 11 .
FIGURE 11.Illustration of sliding window method.

FIGURE 12 .
FIGURE 12.The results of hyperparameter selections.(a) the input length, (b) the encoder number, (c) the dimension of model, (d) the initial learning rate, (e) the dimension of feed-forward network, and (f) dropout rate.

FIGURE 13 .
FIGURE 13.Evaluation results of forecasting in different models.

FIGURE 14 .
FIGURE 14. Comparative results of one-step-ahead AC1 power forecasting in two weeks on Floor 2, Zone 1.

FIGURE 15 .
FIGURE 15.Comparative results of one-step-ahead temperature forecasting in two weeks on Floor 2, Zone 1.

FIGURE 16 .
FIGURE 16.Comparative results of two-step-ahead light power forecasting in two weeks on Floor 4, Zone 1.

FIGURE 17 .
FIGURE 17. Comparative results of two-step-ahead AC1 power forecasting in two weeks on Floor 4, Zone 4.

FIGURE 18 .
FIGURE 18. Comparative results of three-step-ahead light power forecasting in two weeks on Floor 5, Zone 1.

FIGURE 19 .
FIGURE 19.Comparative results of three-step-ahead temperature forecasting in two weeks on Floor 5, Zone 4.
M )×d model ×N Transformer encoders Hyperparameters: encoder number E, multi-head number H , dimension of query d q , key d k and value d v where d q

TABLE 2 .
The hyperparameters of comparison models.

TABLE 3 .
Performance evaluation of one-step-ahead multi-load and indoor environmental forecast in different models.

TABLE 4 .
Performance evaluation of two-step-ahead multi-load and indoor environmental forecast in different models.23%,12%,and 36.4% averaged improvement on MAE, MSE, RMSE, and aSMAPE, respectively.Furthermore, the detailed evaluation results of each floor with all forecasting horizons are listed in Table3-5, and the

TABLE 5 .
Performance evaluation of three-step-ahead multi-load and indoor environmental forecast in different models.