Short-Term Energy Consumption Forecasting at the Edge: A Federated Learning Approach

Residential short-term energy consumption forecasting plays an essential role in modern decentralized power systems. The rise of innovative prediction methods able to handle the high volatility of users’ electrical load has posed the basis to accomplish this task. However these methods, which mostly rely on Artificial Neural Networks, require that a huge amount of users’ fine-grained sensitive consumption data are centrally collected to train a generalized forecasting model, with implications on privacy and scalability. This paper proposes an innovative architecture specifically designed to overcome this need. By exploiting Federated Learning and Edge Computing capabilities, many Long Short-Term Memory (LSTM) models are locally trained by different users based on their own historical energy consumption samples. Such models are then aggregated by a specific-purpose node to build a generalized model that is re-distributed for improved forecasting at the edge. For better forecasting, our proposed local training procedure takes as input relevant features related to calendar (i.e., hour, weekday and average consumption of previous days) and weather conditions (i.e., clustered apparent temperature), and the architecture can group users according to consumption similarities (using K-means) or socioeconomic affinities. We thoroughly evaluate the approach through simulations, showing that it can lead to similar forecasting performance than a state-of-the-art centralized solution in terms of Root Mean Square Error (RMSE), but with up to an order of magnitude lower training time and up to 50 times less exchanged data when samples are recorded at finer granularity than one hour. Nonetheless, it keeps sensitive data local and therefore guarantees users’ privacy.


I. INTRODUCTION
E NERGY consumption (or load) forecasting is a fundamental task for a sustainable planning, production and transport of electricity in modern power systems [1]. While long-term load forecasting is an important means to support the long-term planning and evolution of the power system, short-term forecasting is essential to facilitate its operations [2]. Especially, short-term load forecasting at a residential customer level is becoming more and more attractive with the ever increasing decentralization of renewable electricity generation (e.g. by photovoltaic systems) and market, being a powerful tool that can be exploited by households to maximize their self-sufficiency. However, residential demand is very volatile and quickly fluctuates over time; for this reason, this task is much more challenging than long-term forecasting or aggregated short-term forecasting [3].
In the recent years, many works have faced this challenge by proposing Artificial Intelligence (AI) methods with the explicit goal of enhancing short-term residential load forecasting as much as possible [4]. However, most of the proposed AI methods naturally require significant amounts of fine-grained historical data [5] that have to be collected and stored by the power system, in centralized locations, for an accurate model training. This operation is nowadays eased by the Advanced Metering Infrastructure (AMI), current being deployed in many countries all over the world [6]: the AMI relies on devices called smart meters, which are deployed at customers' premises and enable the recording of fine-grained energy consumption measurements with sampling rates of 15 minutes or more.
Even though such technological advancement unlocks new services to both users and energy suppliers [7], many privacy concerns have been raised [8] by both regulation authorities and users. In fact, such collected information can be used to disclose customers' habits [9] and, in some countries, users can refuse the installation of a smart meter. Various privacypreserving solutions relying on data aggregation and/or obfuscation have been proposed to ensure privacy while guar-anteeing the collection of some valuable information by the energy suppliers or third parties [8], but unfortunately they are incompatible with any proposed solutions for residential short-term load forecasting, which all need households' finegrained measurements as input data. Not less important, existing AI-based solutions are computationally intensive in the model training phase, and their adoption is far from being scalable when data coming from millions of smart meters are expected to be used in that phase. A possible alternative is training the AI model on a data subset, but this unavoidably impacts on model generalization abilities.
Edge Computing [10] is a distributed paradigm that has been introduced to offload computation closer to the end users. It extends Cloud Computing towards the edge of the network, with tremendous advantages in terms of speed, efficiency, reliability, privacy, security and scalability. Edge Computing has driven the emergence of numberless innovative applications, especially in the Internet of Things (IoT) domain [11], and appears to be the most natural choice to overcome the privacy and scalability issues that occur when an AMI is used to collect and process sampled energy consumption measurements. However, some critical aspect must be taken into account to exploit this paradigm for shortterm distributed load forecasting, mainly concerning the inherent nature of existing AI methods, as they assure model generalization only if centrally trained by data collected from many different users. This is in contrast with the most basic Edge Computing best practice, according to which sensitive data should be kept distributed and locally processed.
Recently, Federated Learning (FL) [12] has been proposed to bridge the gap between AI and Edge Computing. Federated Learning is a distributed machine learning approach where a shared global model is trained, under the coordination of a central entity, by a federation of participating devices. The peculiarity of the approach is that each device trains a local model using its own data, which never leave the edge: only model parameters are sent to the central entity for updating the shared global model. Federated Learning unlocks the full potential of Edge Computing for machine learning purposes, and has been already successfully adopted in some application domains [13], such as human-computer interaction [14], language modeling [15], healthcare [16] [17] [18], transportation [19] [20] and Industry 4.0 [21], where privacy and/or scalability aspects are fundamental.
In this paper we propose a short-term distributed load forecasting architecture based on Edge Computing and Federated Learning. According to our architecture, data collected by the smart meters are kept at the edge and used to collaboratively train a global model that, once re-distributed to the customers, enhances their forecasting capabilities, since it is able to identify unknown local occurring patterns that have however been experienced by other users. The architecture relies on a Long Short-Term Memory (LSTM) [22] neural network trained following the Federated Learning schema. The neural network takes as input, in addition to energy consumption samples, other features related to calendar and to local weather conditions, namely weekday, hour, average consumption in the previous days and clustered apparent temperature, which are proven to enhance the overall training procedure. The architecture also envisions the possibility to cluster similar users (e.g. according to socioeconomic aspects or to consumption similarities) with the goal of further reducing the prediction errors.
We thoroughly evaluate our approach and show that it can drive to forecasting capabilities that are as good as those of an analogous AI-based strategy requiring data centralization, but outperforms the latter in terms of model training time, offering a far more scalable and inherently privacy-aware alternative. In addition, our strategy leads to reduced communication overhead when energy consumption measurements are recorded at fine granularity (i.e., with resolution higher than one hour). We also show that users' clustering is effective, and that our proposal is more effective than other trivial privacy-preserving schemes, as it is more adaptable to rapidly-changing consumption patterns.

A. KEY CONTRIBUTIONS
The key contributions of this paper are the following: • We propose a novel LSTM-based architecture for shortterm load forecasting based on Federated Learning and Edge Computing. • We define two approaches for users' clustering, based on socioeconomic and consumption similarities, to improve forecasting. The former is strictly related to the dataset adopted in this study, while the latter exploits the K-Means algorithm. • We analyze an existing dataset to extrapolate relevant calendar-and weather-related attributes to be used during local training to disclose seasonality patterns. • We evaluate our proposed strategy in terms of forecasting performance, scalability and communication overhead against a centralized state-of-the-art solution. • We preliminarily investigate the benefits of users' clustering and of calendar-and weather-related features on local training. • We evaluate the impact of some tuning parameters (i.e., traces used for pre-training, number of selected nodes per round, amount of training data) on our strategy's performance. • We compare our strategy with other two possible and trivial privacy-preserving schemes.

B. ORGANIZATION OF THE PAPER
The remainder of the paper is structured as follows. Section II recalls the related work, while Section III introduces some background notions. Section IV reports the system architecture, Section V focuses on data pre-processing and on feature extraction, while Section VI provides details on LSTM model configuration. The performance of our proposal is evaluated in Section VII, while Section VIII concludes the paper and highlights the future work. Figure 1 reports a schematic  overview of the paper: the arrows indicate inter-dependencies among (sub)sections.

II. RELATED WORK
Short-term energy consumption forecasting has been a hot topic for many years and different techniques have been proposed to enhance prediction performance, ranging from leveraging spatial correlation [23] to exploiting enriched temporal data [4]. Another fundamental aspect is the right choice of the most suitable methodologies and models. Surveys [24] [25] [26] review the most relevant papers on energy consumption forecasting, with focus on data-driven models.
From an analysis of the reported works, it emerges that most of them focus on non-residential scenarios and that the most widely adopted models (or methodologies) are Artificial Neural Networks (ANNs) [ [38]. Paper [39] compares seven different techniques for energy consumption forecasting, including ANN and Least Squares Support Vector Machine (LS-SVM). The models are trained with energy consumption measurements collected every 15 minutes and energy consumption is predicted for the next considered hour. The reported results show that ANN generally leads to better accuracy than LS-SVM in the case of load forecasting for commercial buildings, but not for residential ones. The main reason, as shown in [28], is that ANN forecasting performance is highly dependent on how parameters and hyperparameters are set, with the risk of falling in local minima during the training process. However, exploiting energy consumption related information such as activity [26] or behavioral [40] patterns can help mitigate the aforementioned issue.
Deep Learning has also been widely adopted for load forecasting. One of the most relevant works in this context is [31], which proposes an approach based on Conditional Restricted Boltzmann Machine (CRBM). The authors show that their method outperforms state-of-the-art ANN and SVM techniques. Similarly, in [32] a polling-based Deep Recurrent Neural Network is proposed, with the goal of preventing (or strongly limiting) overfitting. The approach outperforms traditional solutions such as ARIMA, Support Vector Regression and Recurrent Neural Network (RNN) approaches. Both these works show very promising results, but Deep Learning methods are usually too computationally intensive to be adopted at the edge by resource-constrained devices. Additionally, in [41] the authors propose an online adaptive method, able to continuously learn from newly arriving data and adapt to new patterns.
Recently, Long Short-Term Memory [22], an advanced and more flexible RNN architecture, has gained momentum. Being very well-suited for time-series prediction, it is the most natural candidate method to be adopted in the context of short-term load forecasting. One of the first works that adopts and assesses LSTM in this domain is [42]. The paper reveals that LSTM outperforms conventional backpropagation ANNs, since it is much more able to learn long-term temporal correlations. Another recent approach based on LSTM is [43]. The proposed solution is very flexible: it is able to forecast energy consumption with good accuracy, also when the LSTM model is used for load prediction of residential houses whose historical samples have not been included in the training set. However, such a flexibility can be guaranteed only if a huge amount of training data are used, which makes the approach demanding from a computational perspective. Some other papers propose advanced LSTM-based solutions: paper [44] applies wavelet decomposition to input data to remove unnecessary details, paper [45] proposes a hybrid LSTM and Convolutional Neural Network (CNN) model, while paper [46] focuses on joint load and price forecasting.
All the works recalled above have, as main goal, an improvement of existing prediction models to get the best possible forecasting performance, mainly considering the smart meters' energy consumption measurements as sole input feature in the training phase. However, they do not investigate how to further reduce prediction errors by exploiting (i) clustering and/or (ii) additional relevant features, which is instead the goal of the following works.
Clustering as a means to improve load forecasting has been investigated in [47] [48] [49]. These works show that grouping residential customers according to similarity on their energy consumption patterns is beneficial. Especially, [47] evaluates clustering when adopted in combination to different prediction models. It is shown that RNNs (and in particular LSTMs) guarantee significantly improved performance if they are trained by only using data from customers that belong to the same cluster. Instead, paper [49] groups energy customers into two clusters (related to low energy customers and high energy customers) and shows that this positively impacts on forecasting performance for both clusters. Some additional relevant features can also be used as input in the training phase. In literature, the most relevant  [53]. Features such as temperature, humidity (weather-related), day of the week or month of the year (calendar-related), if given as input in the training phase, can drive it to output more generalized models. Although the machine-learning-based solutions recalled so far are different from many points of view, they all have a common architectural aspect: they require a centralized entity that collects energy consumption measurements from the customers to centrally train a global model. In other words, all the works implicitly or explicitly adopt an architecture as the one depicted in Fig. 2 that, from now on, we will call centralized architecture and consider as benchmark.
Only one very recent work can be found in literature that, similarly to this paper, adopts Federated Learning for load forecasting at the edge [54]. As that work, we also rely on LSTM and envision the possibility to distribute the training burden among multiple edge nodes. However, paper [54] considers energy consumption as the sole input feature of the training process, while we include additional features related to weather and calendar, and we propose a strategy that relies on customers' clustering. These two advancements are demonstrated to enhance the forecasting precision. Additionally, for the first time, we carry on an extensive performance comparison with the centralized approach shown in Fig. 2, adopted in all the other previous works.

III. BACKGROUND
In this Section we briefly recall the most important concepts related to LSTM and Federated Learning.

A. LONG-SHORT TERM MEMORY
A Long Short-Term Memory is a machine learning model belonging to the family of ANNs or, more specifically, RNNs. It is widely adopted to solve sequence classification problems and is very suitable for time series prediction. Before describing more in detail LSTM, we briefly describe the structure of ANNs and RNNs, highlighting the differences.
An ANN includes different neurons organized in sequential layers. A neuron is the atomic unit of ANNs: it applies a function on the input data (usually a weighted sum) and, later, it passes the obtained value through an activation function (that is, a threshold function). The result is then forwarded to other neurons. Each neuron is associated to multiple weights w j (being j the index of each weight) that are adopted in the weighted sums and related to neurons' interconnections. The goal of the training phase of an ANNs is deciding the most appropriate model's weights w j to maximize the performance of the network to accomplish its specific task. In an ANN, the output of a neuron in a layer is always used as input to one or more neurons in the next layer. This feeding mechanism is called feed-forward. An ANN includes three types of layers: an input layer, which receives the input data and performs a first elaboration step; one or more hidden layers, whose neurons elaborate input from previous layers and forward the result to the next layers; an output Layer, whose duty is to produce the final result.
An RNN is a generalized ANN where neurons in a layer can also be interconnected with neurons of previous layers, of the same layer or with themselves in a loop. An important feature of RRNs is that they maintain and use an internal state to capture dynamics over time t. However, typical RRNs have only short-term memory capabilities.
An LSTM [22] is a specific and more complex type of RNN, whose elementary unit is called LSTM cell (instead of neuron) and that has long-term memory capabilities. In an LSTM cell the internal state is modelled by two vectors: h(t) is the short-term state, always equal to the cell output y(t) in any instant t, while c(t) is the long-term cell state, which is maintained and updated over time. Each LSTM cell includes three gates, whose duty is to add/remove information to/from the cell state c(t) and to compute the cell output y(t): the input gate decides what information should be kept from current input x(t) to compute the current state c(t); the forget gate decides what information should be kept and what information should be thrown away from the previous state c(t − 1) to compute c(t); the output gate finally decides the output y(t) and the current state c(t), which will be given as input to the cell in the next time instant t + 1.

B. FEDERATED LEARNING
Federated Learning is used to train a machine learning model in a distributed way, where different remote devices use their own collected data to carry on a local training procedure and a centralized server is then in charge of aggregating the trained local models into a global model, which is in turn re-distributed to the remote devices for a further training round. Following this iterative process, an arbitrarily high number of devices can concur to model training without the need of transferring collected data to a centralized location, since only locally-trained models need to be sent. Federated Learning has been proven to work well also when the remote devices train the model using non independent and identically distributed (non-iid) data. Moreover, it has been demonstrated to bring benefits in terms of amount of exchanged data with respect to a solution requiring data transmission to a single location for centralized training [12]. In the following, we will provide some information on the training workflow and on the local model aggregation procedure.

1) Training workflow
While a detailed description of a typical FL training workflow is reported in [55], here we just recall the main aspects. Time is slotted in rounds and two entities are involved in the workflow: the end devices and the centralized server. Specifically, the following steps are carried out (see Fig. 3 c) The edge devices train the received model using their local data. d) The edge devices send the trained local model back to the server, in the form of updated weights. e) The server aggregates all the received local models (i.e., it computes the new weights from the updated weights received by any end device) and generates an updated global model. How end devices are selected in each round depends on the specific application, as thoroughly investigated in [56].
Another key aspect is how aggregation at the centralized server occurs. In the following, we will briefly recall the most widely-adopted aggregation procedure, called Federated Averaging (FedAVG).

2) Federated Averaging
FedAVG, introduced in [12], aims at minimizing the global loss function. FedAVG is the generalization of another aggregation procedure, called FedSGD [12]. In FedSGD, the selected end devices, in each round, perform a single step of Stochastic Gradient Descent (SDG) and send the obtained model weights to the server. The server then averages the received weights proportionally to the number of locallyused training samples: this operation can be seen as a gradient descent step on the global model. One of the drawbacks of such an approach is that, by executing only a single SDG step per round, global model training slowly converges.
FedAVG overcomes this limitation by introducing the concepts of local epoch and of batch. In each round, multiple local epochs are executed on a batch (i.e., a subset of local data): for each local epoch and batch an SDG step is done and, in this way, less server-device interaction is needed. Then, the server averages the received weights proportionally to the number of locally-used training samples as done by FedSDG. Three parameters need thus to be carefully tuned to achieve the best possible FedAVG performance:  • B size : amount of local samples that are used for training in each round/epoch (batch size). The reader should refer to Tab. 3 for a recap of all the parameters and symbols adopted in this paper.

IV. SYSTEM ARCHITECTURE
In this section we introduce our federated short-term energy consumption forecasting architecture (in brief, federated architecture, Fig. 4). The proposed architecture enables the possibility to accurately forecast households' energy consumption by collaboratively training an LSTM network. Local LSTM networks are trained by a multitude of edge devices through locally-generated energy consumption measurements, while centralized model aggregation and redistribution procedures, as envisioned by Federated Learning, are adopted to get a trained global LSTM network. Two entities are considered in our architecture: • Energy company: it supplies electricity to residential customers. It may rely on a Smart Grid [57] for enhanced energy supply. • Residential customers: they consume electricity in their houses and are billed by the energy company. Three functional nodes, defined by taking inspiration from already-existing functional nodes of Federated Learning (see Section III-B), are included and interact: • Smart Meter: it is owned by the energy company and installed in any user premises. It records energy con-sumption measurements (i.e., samples) at very fine granularity (up to one per minute) [58]. • Edge Computing Node: it is either a specific-purpose (e.g. a GPU board) or a generic-purpose (e.g. a PC) computing node that can be owned by the energy company (and installed at the customer's premises) or by the customer herself. It is connected to the Smart Meter through the most appropriate wireless (e.g. Zigbee) or wired (e.g. Ethernet) network connection and is responsible for storing energy consumption measurements as captured by the Smart Meter. Additionally, it is also either connected to outdoor sensors able to capture weather variables (e.g. relative humidity, temperature, wind speed, etc.) or it is able to retrieve less accurate data from external weather data repositories. The Edge Computing Node is in charge of local model training, using the collected data as recalled above and, additionally, calendar information (more details on considered input features will be given in Section V). It is also connected to the Aggregator (see below) for model exchange, and can interact with customer's end devices (e.g. a smartphone) by means of specific-purpose applications used for data visualization and as an end point for the alarms generated by the Edge Computing Node (see Section IV-B). • Aggregator: it is a centralized server owned by the energy company. Its main responsibility is to gather ∀j local models at the end of each round, as trained by multiple the Edge Computing Nodes, and aggregate them. The output of this operation is a global model, which is then re-distributed to Edge Computing Nodes. The Aggregator communicates with the Edge Computing Nodes through encrypted end-to-end connections. Note that the proposed architecture could be adopted also if other entities (e.g. a statistical institute) instead of the energy company are involved. In this case, the customer should be equipped with an energy meter sensor connected to its magnetothermic switch, as she could not directly access the data collected by the Smart Meter.

A. MODEL TRAINING PROCEDURE
The proposed training procedure adapts the workflow shown in Section III-B1: Fig. 4 reports the steps as numbered and described in that Section, while Algorithm 1 details the training operations. FedAVG is used for model update and aggregation (see [12]). With respect to the workflow reported in Section III-B1 we envision the possibility, once the LSTM model is initialized and before starting the FL-based training procedure, to centrally pre-train LSTM with some available samples collected by the energy company, e.g. from users that, in change of discounted energy bills, are keen to share them. Pre-training will be evaluated in Section VII.
The training workflow could potentially be executed continuously and indefinitely, so that the obtained model can be constantly refined (i.e., re-trained) with new collected data from the Smart Meters. A sliding window (e.g. of 365 days) can be implemented, so that not too old historical data is considered while re-training. Given the potentially high number of Edge Computing Nodes involved in the process (i.e., millions), if a low fraction C of them is randomly chosen in each round (e.g. C = 10 −4 ), the computational burden on each Edge Computing Node can be kept low. To reduce even further the computational effort, the training procedure can be executed periodically (e.g. every few weeks) for a limited amount of rounds N round . In Section VII we will show that this is good enough to have an always-updated model for forecasting.
It is also important to specify that the energy company or any other entity that ensures weights aggregation and training coordination through its Aggregator, has at its disposal an always-updated model, which could be offered as-a-service to third parties or used to finely adapt energy transport and/or production strategies.

B. REAL-TIME ENERGY CONSUMPTION FORECASTING
Once the LSTM network has been trained following the procedure shown in Alg. 1, it can be used for real-time and short-term energy consumption forecasting at the edge. The ability to forecast the load in the near future (e.g. in the next hour) makes it also possible to locally notify customers about any possible consumption anomaly [59]. IfŶ t is the forecast value at time t and Y t the real recorded value, an alert is generated to the customer's application if Y t −Ŷ t > T h , where T h is a given threshold. Such an alert, occurring when an excessively high consumption has been experienced, warns the customer about this abnormal behavior. Conversely, if Y t −Y t > T l , a notification of good behavior is generated: this kind of notifications make the customer aware of a positive trend in her actual consumption, which is significantly lower than expected.
Such a local notification system can be used as a tool to increase customers' awareness on their good and bad habits so that they can consequently adapt their behavior. Ideally, a customer should try to minimize the number of received alerts, while instead maximizing the number of notifications of good behavior. Clearly, setting the proper value for T h and T l is of paramount importance to avoid the generation of unneeded notifications or to spot exceptional anomalous behaviors. For instance, the value of T h and T l may be changed VOLUME X, 2021 through the customer's application during vacation times. T l can be lowered to avoid false notifications of good behavior, while T h can be lowered to spot anomalous activities in the house, that should not be considered anomalous outside vacation times. How to adjust the thresholds is however out of the scope of this paper.

C. CUSTOMERS' CLUSTERING
Forecasting can be improved by exploiting similarities between time series through appropriate clustering. In our architecture we foresee this possibility: for instance, by means of aggregate energy consumption data (e.g. those used for billing purposes) or by means of socioeconomic aspects (if available), the energy company can group customers in k different clusters. If this is done, the proposed architecture just needs to be replicated k times, with k different LSTM networks that have to be concurrently trained and with the Aggregator that is in charge of customers' clustering and of initializing and correctly updating the k LSTM networks.

V. DATA ANALYSIS AND PRE-PROCESSING
In this Section we describe and analyze how data are preprocessed before giving them as input to the training process. We also show how energy consumption measurements can be enriched with additional attributes, with the final goal of improving forecasting performance. To make analysis and pre-processing tasks concrete, in this work we consider a set of fine-grained energy consumption measurements as taken from a large dataset collected and published by the energy company UK Power Networks [60]. The dataset includes data collected by 5,567 Smart Meters in London between November 2012 and February 2014, expressed in kWhh (kW per half-hour) and with a granularity of 30 minutes.
Two peculiarities of the dataset mainly motivate its adoption in this work. First, each trace (i.e., time series collected by a Smart Meter) is associated to a category as specified by the Classification Of Residential Neighborhoods' (ACORN) standard [61]. ACORN is a consumer classification that segments the UK population into different demographic types according to social factors and population behaviors; demographic types are then grouped into 18 different macrogroups. Second, the dataset has already been enriched with different meteorological variables recorded in London [62] and collected through DarkSky API [63] for the whole covered period. This greatly simplifies the extraction of weatherrelated features as required by our architecture.
Even though the remainder of the Section is tailored on the UK Power Networks dataset, the proposed procedures can be straightforwardly applied to other available and more recent datasets, such as Dataport [64].

A. DATA SELECTION
We focused on data collected from January 1st, 2013 to December 31th, 2013. Among all the 5,567 Smart Meters, only 4,968 where active in the reference period. We then removed from the set some outliers with abnormal energy consumption patterns: we considered as outliers all those traces that have an average consumption that is either lower than 0.6 kWhh or higher than 3.6 kWhh. These outliers indicate potential errors in the metering activity or are related to empty houses. The remaining Smart Meters are 4,329.
We decided to reduce even further the number of selected Smart Meters to ensure tractability, given the limited amount of computational resources available for our tests. We randomly selected 35% of the traces for each one of the 18 ACORN macro-groups, so that the final dataset preserves original ACORN proportions. The number of considered traces is then reduced to 1,507, distributed among ACORN macro-groups as indicated in the left-hand side of Tab. 1.
Finally, to further reduce the dataset size while preserving seasonality patterns, we aggregated data samples to obtain a trace with coarser granularity of 1 hour: measurements are then expressed in kWh. An example of processed trace is shown in Fig. 5.

B. CUSTOMERS' CLUSTERING
As shown in the previous subsection, the UK Power Networks dataset inherently provides traces' clustering according to customers' behavioral and socioeconomic aspects (ACORN). As an alternative strategy, we decided to cluster the traces only according to consumption similarities: to do so, we adopted the K-Means [65] method. For all the selected 1,507 traces we computed the following features on all the energy consumption samples (8,760 as we focus on hourly granularity for a whole year) of each trace, which were provided as input to K-Means: • Energy consumption median (kWh); • Energy consumption average (kWh); • Energy consumption sum (kWh); • Weekday with average highest energy consumption, encoded by integers from 0 (Monday) to 6 (Sunday);   • Weekday with average lowest energy consumption, encoded by integers from 0 (Monday) to 6 (Sunday); • Highest recorded energy consumption (kWh); • Lowest recorded energy consumption (kWh).
These features disclose less information on customers' behavior than the fine-grained traces, and may be shared with third parties such as the energy company. We chose k = 18, so that 18 different clusters are obtained as in the ACORNbased clustering. In this way, it is possible to fairly compare the two clustering methods. How traces are distributed among clusters by K-Means is reported in the right-hand side of Tab. 1: it is easy to see that K-Means distributes traces more evenly among clusters than ACORN.

C. ADDITIONAL ATTRIBUTES
In this subsection we describe how we enriched the dataset with additional attributes than the default ones already available from [62], which are the acquisition time of each sample (timestamp) and the hourly energy consumption value (see the first two columns of Tab. 2).
Two effective attributes related to calendar that help the LSTM model learn daily and weekly patterns are the weekday and the hour of the day. As done in the previous subsection, we encoded the weekday as an integer ranging from 0 (Monday) to 6 (Sunday). Concerning the hour, it can be extrapolated from the timestamp and encoded as an integer ranging from 0 to 23. An example for these attributes is shown in the third and fourth columns of Tab. 2.
In the following, we will detail how we obtained other two important attributes, namely AVG4D and TempCluster (see the last two columns of Tab. 2), which are related to calendar aspects and weather conditions respectively.

1) AVG4D attribute
This attribute is computed and added to the dataset so that the LSTM model can learn energy consumption differences between weekdays and weekends, and existing similarities among different consecutive days at the same hour. To compute AVG4D we took inspiration from [66], which proposes a linear regression technique taking as input the energy consumption at hour t of the previous N days to estimate the energy consumption at hour t of the current day. They have shown that the best prediction performance is obtained when N = 4. Even though we adopt a different method for forecasting, adding the average energy consumption of the four previous days, at the same hour, to each sample helps the model learn recent energy consumption trends at different hours of the day, and flattens any possible abnormal consumption at the same hour in the closer days.
The definition of "four previous days" is however not univocal in the computation of AVG4D. By analyzing the dataset, we realized that the average daily energy consumption is slightly lower in weekdays (Monday to Friday) than in weekends (Saturday and Sunday), and that the consumption is similar to that of close previous days, at the same hour, that fall within the same category. We thus decided to consider as four previous days the ones that fall in the same category of the current day (i.e., weekdays or weekends). Figure 6 better explains this point: if, for instance, the current day is a Sunday (weekend), the four considered previous days to compute AVG4D are the day before (Saturday), Sunday and Saturday of the previous week, and Sunday of two weeks before. Instead, if the current day is a Friday (weekday), the considered days to compute AVG4D are Thursday, Wednesday, Tuesday and Monday of the same week.
AVG4D has thus generally higher values during weekends than weekdays and, besides explicitly providing the information on average consumption in close days at the same hour, VOLUME X, 2021 implicitly discloses weekends-weekdays patterns that could also be extrapolated by an additional attribute exploiting an one-hot encoding of weekdays and weekend days.

2) TempCluster attribute
Seasonality patterns are a key aspect that needs to be considered. In the previous subsections, we introduced some attributes (i.e., Weekday, Hour and AVG4D) that help, in the model training phase, capture any daily or weekly seasonal pattern. The TempCluster attribute is instead introduced to capture yearly seasonal patterns. By analyzing the available dataset at a first glance, we realized that energy consumption is generally higher in winter than in summer. This is most probably due to greater presence in houses, less available solar light and usage of electrical heating systems. Additionally, being the dataset related to customers from London, where summer is not much warm, it is reasonable to assume that more energy is needed to heat the environment in winter than to cool it in summer. More in general, it looks clear that energy consumption is greatly correlated with weather conditions, as already investigated in some previous works (see Section II).
We more thoroughly analyzed the dataset that, as already mentioned, is accompanied with meteorological variables recorded in London using the DarkSky API. The included variables are: maximum external temperature, relative humidity, visibility, wind speed, dew point and ultraviolet (UV) index. For each of the considered traces as selected in Section V-A, we computed the Pearson correlation coefficient (i) between energy consumption and all the meteorological variables and (ii) among meteorological variables. The resulting correlation matrices (one per trace) were then averaged and the average correlation matrix is reported in Fig. 7. Not surprisingly, it can be seen that energy consumption is highly negatively correlated with the maximum external temperature. It also has a high negative correlation with dew point and UV index but, being them highly positively correlated with the maximum external temperature, we neglected it to avoid multicollinearity. The relative humidity has a moderate correlation with energy consumption, while the wind speed has a low correlation. Moreover, visibility has a high negative correlation with relative humidity and we neglected it to avoid multicollinearity.
Given the above analysis, the meteorological variables that to some extent are correlated to energy consumption and worth being considered are maximum external temperature, relative humidity and wind speed. Including these three variables as stand-alone attributes to the dataset would add  unnecessary complexity to the input data for two reasons. First, all the three variables contribute to the apparent temperature (AT) in shade, which can be compute as [67]: where T ext is the maximum external temperature (°C), ws is the wind speed (m/s) and e is the water vapour pressure (hPa), computed as: where rh is the relative humidity. To simplify without losing much information, we then further consider only the apparent temperature as weather-related attribute. Second, the apparent temperature (as its component variables) is very volatile but, clearly, minimal changes among hours are expected not to have any noticeable effect on energy consumption. Conversely, including the hourly apparent temperature would add avoidable noise during the training phase. This pushed us to reduce the granularity of the attribute by clustering it through the K-Means algorithm. We gave as input feature the sole apparent temperature for the whole set of samples in the reference period (i.e., 8,760 samples). To decide the optimal number of clusters k we used the elbow method [68], which indicated that the best number of clusters is k = 2. This means that an attribute disclosing whether the apparent temperature, in a specific hour of the day, is "cold" or "warm" is a good-enough weather-related attribute to be given as input to the LSTM model in the training phase.
The output of the aforementioned clustering operation is the attribute that we called TempCluster. For each hour of each day of year 2013, we encoded all the apparent temperatures belonging to the "cold" (resp. "warm") cluster with 1 (resp. 0): the result is shown in Fig. 8. The figure shows that  TempCluster has value 1 for most of the hours in winter, has value 0 for all the summer hours, while it frequently swaps between 0 and 1 in mid seasons (i.e., spring and autumn).

VI. LSTM MODEL CONFIGURATION
This section aims at describing how input data are modelled and how LSTM hyperparameters are chosen in the federated architecture, while also highlighting the differences with the model adopted in the centralized architecture for comparison.

A. INPUT DATA AND DESIGN
At time instant t, a number X = |χ| of historical samples is given as input to the LSTM model (including the value recorded at t) to forecast the energy consumption at t + 1, where χ = {0, . . . , X − 1}. In our experiments, we set X = 24: this means that at time t samples at times t, t − 1, . . . , t − 23 are used for forecasting. In other words, our system implements a sliding window with look-backs of size 24 samples (i.e., one day) and look-ahead of size 1 sample. The model adopted in this work consists of four layers (two hidden) and is similar to the one adopted in [43]: 1) The first layer (input layer) takes as input the X = 24 historical samples. 2) The second layer (hidden layer) includes 32 LSTM cells. The hyperbolic tangent (tanh) is chosen as an activation function.
3) The third layer (hidden layer) takes as input the output of the 32 LSTM cells of the second layer and includes 16 LSTM cells adopting tanh as an activation function. 4) The fourth layer (output layer) takes as input the output of the 16 LSTM cells of the second layer and consists of a dense layer, with only one output unit: the output unit value represents the energy consumption forecast for the next hour.

B. LOCAL MODEL TRAINING
We consider a different number of energy consumption samples S, ranging from S = 2, 880 (four month) to S = 7, 920 (11 months) for each involved Edge Computing Node. As additional features, we include a subset of the attributes described in the previous Section and reported in Tab. 2. Specifically, we include weekday, hour, AVG4D and Tem-pCluster, for a total of F = 5 features. We instead do not include timestamp since it is a simple text string, and meaningful information has been already extrapolated from it (i.e., the hour attribute). We chose the hyperparameters for model training after careful manual tuning using partial data from the constructed dataset. As seen in Section III, the three most important parameters that have to be tuned to ensure good FedAVG performance are the number of local epochs N epoch , the batch size B size and the number of rounds N round . Considering that Edge Computing Nodes are expected to have relatively limited processing capabilities, we set N epoch = 5, B size = 100 and N round = 50. This design choice ensures a limited burden on the devices when they are selected for local training but, at the same time, good overall forecasting performance, as we will evaluate later. We chose the Mean Absolute Error (MAE) as evaluation metric for the loss function, and we adopt the Adaptive Moment Estimation (Adam) variant of Stochastic Gradient Descent as optimization algorithm, with a learning rate LR edge = 0.0001.
We employ two well-known regularization methods to avoid or limit overfitting: • Dropout: it consists on randomly neglecting some LSTM cells during each training step. We adopt this technique during local training at the Edge Computing Nodes for any of the two LSTM layers, and we set as dropout rate DR rate = 10%, meaning that 10% of LSTM are neglected at any training step. This helps the process escape from local minima. • Early Stop: it consists on stopping the training operation after the execution of a certain number of rounds, i.e., when the loss function does not improve anymore on validation data: we adopt this technique at the Aggregator on the global model. Early Stop relies on a counter that is incremented when loss function worsens with respect to the previous round. The training procedure is stopped when the counter reaches a value ES thresh = 3 since it is likely that, when this condition occurs, no further loss function improvement can be obtained.

C. CENTRALIZED MODEL FOR COMPARISON
For comparison purposes, we designed and implemented an LSTM model for the state-of-the-art centralized architecture depicted in Fig. 2. The model structure is analogous to the one described in Section VI-A and includes four layers, as done in the state-of-the-art work [43], which embraces a centralized architecture and is thus considered as benchmark: VOLUME X, 2021 an input layer, two LSTM hidden layers including 32 and 16 LSTM cells respectively, and an output dense layer with one output unit. All the hyperparameters were chosen as specified in Section VI-B with only two differences: • Given the centralized nature of the learning approach, time is not slotted in rounds, so the N round parameter is not considered. Instead, the maximum number of training epochs N epoch at the centralized server is set to N epoch = 50, and Early Stop is used among epochs. • The learning rate LR edge is not considered. Instead, a global learning rate is specified, i.e., LR = 0.0002. This value has been chosen after careful manual tuning. We designed the LSTM model adopted by the centralized architecture as similar as possible to the one adopted by the federated architecture, and we set the hyperparameters for both approaches with the goal to ensure similar forecasting performance, so that the solutions can be fairly compared on other aspects, as we will shown in the next Section.

VII. PERFORMANCE EVALUATION
In this Section we thoroughly evaluate the proposed federated architecture. Unless otherwise specified, we set the parameters as recalled in Tab. 3. We adopted the TensorFlow Federated framework 1 and the Python-based Keras API 2 to implement the LSTM model and test the solution. We used the Google Colab Cloud platform 3 to gather results.

A. TRAINING AND TEST SET CONSTRUCTION
We group the dataset traces according to three criteria:  Section V-B and as reported in the right-hand side of Tab. 1; training and testing are performed per-cluster. • Random: a random set of traces, among the 1,507 traces selected as described in Section V-A, is chosen. Unless otherwise specified, such a set will include n traces = n nodes = 80 traces, which is the average cluster size.
The definition of such a set allows us to test the proposed solution when customers' clustering is not performed.
For any cluster/random set defined as specified above and including n traces traces, we follow the methodology depicted in Fig. 9(a) to define the training and test sets. We randomly include the samples of 70% of the traces in the training set and the samples of 30% of the traces in the test set. Training is performed on data collected from January to November, while testing is done on data collected in December, which is thus considered as testing month unless otherwise specified. As introduced in Section VI, not all the samples of the traces selected for training are included in the training set, as shown in the figure: for each trace, only samples related to a random number of months between 4 and 11 is selected. We decided to follow this approach to limit computational needs, instead of further reducing the number of traces in the training set, because we expect that, in this way, more heterogeneous patterns can be captured in the training phase. In the figure, n tr (resp. n te ) indicates the number of traces used for training (resp. testing), and none of the traces used for testing is included in the training set. It is clear that n tr = 0.7 · n traces and n te = 0.3 · n traces . We also consider a slightly different methodology, shown in Fig. 9(b). In this case, all the traces used for testing in December have their samples from January to November included in the training set. This simulates a scenario where an Edge Computing Node contributes to model training and also benefits from the constructed global model to forecast energy consumption. 70% of traces are randomly selected for training and, among them, 45% are used for testing purposes. In this way, roughly 30% of the traces of any cluster/random set is used for testing (since 0.7 · 0.45 0.3), making the methodology comparable to the one shown in Fig. 9(a).
For each one of the evaluations reported in this Section we randomly created 10 training/test sets following the methodologies described above. Any evaluated performance metric is the average value from each of the 10 resulting instances.

B. EVALUATED METRICS
The following metrics are evaluated: • Root Mean Square Error (RMSE): it is used to measure the forecasting performance and is computed as: where Y t is the recorded energy consumption measurement at time t,Ŷ t the forecast value and N the overall number of testing samples.

C. FEDERATED VS. CENTRALIZED ARCHITECTURE
This subsection compares the performance of the state-ofthe art centralized architecture and our proposed federated architecture, as depicted in Figs. 2 and 4. As specified in Section VI-C, the state-of-the art centralized architecture considered for comparison is the one proposed in [43]. Table 4 reports a performance comparison between federated and centralized approaches in terms of RMSE and training time, considering a Random set including n traces = 80 traces. As already specified in the previous Section, we were able to set the hyperparameters for the federated architecture so that no performance loss is experienced with respect to the centralized one. Additionally, a 10%-15% performance gain is experienced when test traces are included in the training set ( Fig. 9(b)) with respect to when they are not included ( Fig. 9(a)), meaning that an Edge Computing Node should participate in the training process to maximize its forecasting ability, but that a global model trained by other Edge Computing Nodes can still be used if this is not possible (e.g. when the node joins the system and not enough samples have been collected yet). To be as general as possible and pose ourselves in the worst-case scenario, in all the following evaluations we will consider the case where testing traces are not included in the training set. Table 4 also shows a comparison in terms of training time. For the centralized approach, such time has been directly measured on the Google Colab Cloud platform. In fact, it is reasonable to assume that the centralized architecture relies on public (or private) Cloud servers for model training. Conversely, the training time for the federated approach can only be estimated, since we are using a centralized Cloud platform to simulate a system where computation should be instead distributed. Specifically, the training time can be estimated in the following way:

1) Training time and forecasting performance
whereT round (n tr = 1) is the average training time per round when only one trace is included in the training set, as it happens in Edge Computing Nodes when the federated architecture is adopted, and N ES round (n tr ) is the number of rounds that are executed before that Early Stop intervenes. This training time estimation implicitly assumes that the Round Trip Time (RTT) between Edge Computing Nodes and Aggregator is negligible with respect to the training time per round. In our settings, the latter is around 10s, so the assumption is reasonable since RTT is expected to be in the order of hundreds of ms at most [69]. Also the VOLUME X, 2021 FedAVG execution time at the Aggregator can be considered negligible, being an inexpensive sequence of weighted sums. Both Tab. 4 and Fig. 10 show that the LSTM network training time is one order of magnitude lower when federated learning is adopted. The reason is that the federated architecture distributes training computation among Edge Computing Nodes and, as shown in Fig. 10, this makes training time almost invariant to the training set size, since each Edge Computing Node is in charge of model training based only on locally-collected samples. Instead, in the centralized architecture, the training time consistently increases as the training set size increases, confirming that the centralized architecture provides a much less scalable solution. It is however important to point out that the shown results assume the same processing capability for both architectures (i.e., that offered by Google Cloud servers), while it is possible that, in a real deployment, Edge Computing Nodes would be more resource-constrained devices than Cloud servers, thus reducing the training time gain of our proposed solution. This evaluation is left for future work. Figure 11 reports the forecasting performance comparison when the training set size varies. It is shown that the RMSE decreases as the training set size increases, since the trained model is able to recognize more and more occurring patterns. The figure also shows that the federated and centralized approaches always lead to similar performance.

2) Impact of clustering
We evaluate the impact of ACORN and K-Means clustering. We compare the proposed clustering strategies with Random. Figures 12 and 13 show the forecasting performance for ACORN and K-Means; they also include the average RMSE value among all clusters and the results obtained for the Random training set. The federated and centralized architectures lead to very similar forecasting performance for all the clusters. Additionally, it can be seen that the cluster ACORN-U leads to much worse performance than the clusters' average and than Random. By looking at the ACORN macro-groups [61], it can be seen that ACORN-U, recently also called ACORN-R, refers to not private households. It is thus a quite different group than the others, since it includes communal accommodations like military bases, hostels, care homes etc., which generate much more heterogeneous energy consumption patterns than in the other ACORN macro-groups, causing bad performance. Table 5 summarizes the average performance in the case of clustering and compare it with Random, both for federated and centralized approaches; it also reports the standard deviation among different clusters. We also include in the table the performance of ACORN clustering when ACORN-U is excluded. K-Means clustering always outperforms all the other strategies (with a performance gain of around 10%-15%), while ACORN clustering has slightly worse performance than Random if ACORN-U is included in the evaluation, and better performance if excluded (more realistic case). We can conclude that a clustering strategy helps improve the performance and should be considered by-design in the federated architecture, and that clustering based on traces similarities (K-Means) leads to slightly better performance (around 5%) than clustering based on demographics (ACORN). However, this latter case leads to more than acceptable performance if homogeneous clusters are considered (in this evaluation, if ACORN-U is excluded). The case of ACORN-U also indicates that appropriate clustering is needed to ensure some performance gain.

3) Communication overhead
We now compare the federated and centralized architectures in terms of communication overhead, by analytically computing the minimum amount of transmitted data between the edge (i.e., Edge Computing Nodes or Smart Meters) and the energy company (i.e., Aggregator or centralized server) to train the LSTM model. In the federated architecture, the adopted formula is: where D f ed trans is the overall transmitted data, N ES round the estimated number of rounds before that Early Stop intervenes, C the fraction of Edge Computing Nodes randomly selected in each round, n nodes the number of Edge Computing Nodes in the system and S model the size of the model (i.e., the overall size of the transmitted LSTM cell weights). In the federated architecture the local models need to be sent, for all the N ES round rounds, by all the selected Edge Computing Nodes (C ·n nodes ) to the Aggregator, and then the Aggregator needs to re-distribute the computed global model to the selected Edge Computing Nodes (reason why a factor 2 is included in the formula). In our calculations we set C = 10%, N ES round = 30 (experimentally chosen according to Fig. 17) and S model = 0.032 Mb (real value observed for the adopted LSTM model). Note that we do not include the overhead due to model initialization in Eq. 5 (see Step 1 in Section III-B1) since it is paid only once for each Edge Computing Node, i.e., when they join the federated architecture for the first time, and can thus be considered negligible.
For the centralized architecture, the adopted formula is instead the following: where n nodes is the overall number of Smart Meters in the system and S data is the size of the local dataset that needs to be transmitted to the centralized server. Unless otherwise specified, we set S data = 0.36 Mb, which is the size of  a local dataset including historical data and features for 12 months with hourly resolution (i.e., 8760 energy consumption samples and related attributes as shown in Tab. 2). Figure 14 compares the communication overhead between federated and centralized architectures as the number of nodes n nodes in the system varies. It is seen that the transmitted data linearly increase as the number of nodes increases and that it is very similar for the two architectures. Considering the values for S model and S data as extrapolated from our considered dataset, no gain in communication overhead is experienced when the federated architecture is adopted.
However, any possible gain strictly depends on the size of the local dataset S data . Figure 15 compares the communication overhead as the size of local dataset S data varies, considering 1-million users in the system. Being the transmitted data invariant to the dataset local size S data in the federated architecture (as shown in Eq. 5), while it is not in the centralized, the federated architecture leads to much less transmitted data as the resolution (or granularity) of the dataset increases. With a resolution of 1 minute (i.e., 525,600 samples per year), around 50 times less data need to be transmitted if a federated approach is chosen, since only the model, whose size is always fixed to S model , needs to be sent to/from the Aggregator.

D. SENSITIVITY TO INPUT FEATURES
We now focus our evaluation on the federated architecture. As first step, we evaluate the impact on forecasting perfor-VOLUME X, 2021 The results show that considering only energy consumption as feature leads to better forecasting performance than considering it jointly with AVG4D or hour, meaning that these latter features, in our evaluation scenario, do not add any insightful information during the training process if taken alone. Instead, adding as feature weekday and especially TempCluster improves forecasting performance. This means that, in general, the traces experience strong weekly patterns and that exploiting clustered apparent temperature data, discriminating between hot and cold hours, is an effective way to improve the forecasting performance. Moreover, when all the features are jointly considered (last row), some performance gain is experienced with respect to only considering weekday or TempCluster, meaning that AVG4D and hour carry some exploitable information when put aside the other features.

E. IMPACT OF TUNING PARAMETERS
We evaluate the impact of some tuning parameters on forecasting performance in the federated architecture. As described in Section IV the Aggregator can, in the initialization phase, centrally train the LSTM model using a number n pretrain of pre-training traces that are obtained from customers that are willing to share them. Figure 16 shows the forecasting performance and the number of rounds before that Early Stop intervenes when the percentage of pretraining traces increases. The figure shows that, as such a percentage increases, (i) a smaller number of rounds is needed to ensure model convergence and (ii) the forecasting performance is not affected. This means that using pre-training data helps speed up the FL-based training procedure without any impact on forecasting performance. In our settings, if only 15% of the traces are used in the pre-training phase, the number of rounds N ES round can be reduced by 25%, and federated training time is thus significantly reduced.
The parameter C also plays an important role in the federated architecture. A higher value for C implies that more Edge Computing Nodes are involved in the training process in each round, with bad implications on communication  overhead (see Eq. 5) and on processing effort for the nodes, since they are likely to be selected more frequently. However, as shown in Fig. 17, a higher C is beneficial to speed up the training procedure. In other words, the figure shows that a variation of C does not have an impact on training performance, but considering more nodes per round helps reduce N ES round . In our small-scale settings, the number of rounds N ES round can be reduced by around 30% if C = 40%. Figure 18 shows instead the impact of the amount of local training data on forecasting performance and on training time. The same procedure as the one described in Fig. 9 is adopted for the definition of training and test sets, with only one difference. In this case, the Random traces selected for training are all between 1 and 11 months long, depending on the experiment. The x-axis (# Months) indicates the number of months that are considered in each experiment. For example, # Months equal to 6 means that, for all the Random traces, six months are randomly selected for training. Figure  18 reports the obtained results: the more data are used as input by any Edge Computing Node, the better forecasting performance is ensured. The best performance is obtained when the historical data of the previous 11 months (# Months equal to 11) are used for local model training. However, not surprisingly, the higher amount of local data are used, the higher training time per round T round is experienced, indicating that the amount of historical samples must be carefully chosen to avoid excessive training time overheads.

F. SENSITIVITY TO TESTING MONTH
We evaluate the forecasting performance of a model trained over the whole 2013 year on different testing months. With respect to the other subsections, here we consider a model trained as specified in Section VII-A, with the difference that also the December month is part of the training set. We focus on a Random set and we vary the testing month from January to December. The goal is to understand if the trained model is able to capture patterns occurring in different seasons of the year. The results are reported in Tab. 7. It can be seen that an acceptable forecasting performance is achieved for all the testing months, confirming that the trained model is able to capture changing patterns. However, better results are obtained in summertime, since in this season more predictable patterns occur, being people not at home more frequently.

G. COMPARISON WITH A FL-BASED STRATEGY
In this subsection we compare our proposal with an existing FL-based solution [54]. We focus our comparison on a Random set (with n nodes = 80 and n tr = 0.7 · n nodes = 56), which is the most general case also evaluated in [54]. We do not consider any further forecasting performance enhancement due to clustering (as done in this work) or to personalization (as done in [54]). We consider the four scenarios defined and evaluated in the reference paper:  For the state-of-the-art approach, we consider N round = 20 as specified in the paper, while in our strategy the training phase is stopped when Early Stop intervenes. Results are shown in Tab. 8. It is shown that our proposal outperforms the state of the art for all the evaluated scenarios. This happens because the approach defined in [54] only considers energy consumption as input feature in the training phase, while our strategy includes four additional features (i.e., weekday, hour, AVG4D and TempCluster), here proven to enhance the forecasting performance with respect to the state of the art both when C and N epoch vary.

H. COMPARISON WITH OTHER PRIVACY-PRESERVING TRAINING STRATEGIES
The adoption of a federated architecture makes it possible to globally train a model while meeting customers' privacy requirements. However, from a customer perspective, two other trivial privacy-preserving strategies could be chosen: • Training a model using own collected data, without any interaction between the Edge Computing Node and the federated architecture. We call this training strategy single trace (own). • Training a model using a trace gathered somewhere else (e.g. from a public dataset), without any interaction between the Edge Computing Node and the federated architecture. We assume that only one trace is chosen for training to keep the processing burden on the Edge Computing Node low and manageable (as it occurs in the federated and in the single trace (own) cases). We call this training strategy single trace (other). For both the single trace strategies described above, we consider 30% of the traces of a Random set (that is, 24 traces) to train 24 models using the historical data, from January to November, of 24 Smart Meters. Then each model is tested using the same trace that trained it in the case of own, while it is tested with a randomly-chosen different trace from the Random set in the case of other. December is always selected as testing month. Table 9 (Standard column) reports the comparison between single trace (own and other) and federated strategies, where results obtained with the 24 models are averaged. For federated we consider the case where the test traces are included in the training set, simulating a scenario where customers, to maximize their forecasting performance, join the federated architecture and participate to the training process (see also Tab. 4). Table 9 shows that the own strategy has the best performance, since the model is perfectly tailored on the customers' patterns, while other leads to the worst results, showing that using a different trace to train the model VOLUME X, 2021 is not an effective strategy. Federated leads instead to only slightly worse performance than own. However, federated is always a much more effective strategy than single trace if, for some reason, energy consumption patterns of a customer suddenly change (e.g. because a new appliance is connected to the electrical system, or in the case of short-term housing). To evaluate this scenario we modified the December testing month: we created 24 fake composed testing traces, where the consumption pattern changes four times (i.e., every week) as shown in Fig. 19. In each week, the energy consumption of another randomly-chosen trace from the Random set is taken. These artificial composed traces pose us in the worst-case scenario where consumption patterns vary consistently and very frequently. For single trace we evaluate the forecasting performance considering the composed testing traces while using the 24 models as trained above (there is no difference between own and other in this case), while for federated we use the model trained by the federated architecture. Table 9 (Composed column) shows the obtained results. As it can be seen, federated leads to much better forecasting performance than single trace (around 25% better), since the model trained by the federated architecture is more generalized and able to better forecast different patterns, also in the case of frequent changes.

I. ASSESSMENT ON RECENT DATA
Finally, we used an already-deployed IoT system to collect energy consumption measurements occurring in a house located in Lombardy (Italy) during September 2020. We adopted a Shelly Energy Meter 4 connected to the magnetothermic switch of the house to gather energy consumption measurements at hourly granularity. We then considered ten models trained using ten different Random sets to forecast the house's load in any specific hour, using as input the collected data in the previous 24 hours. Table 10 shows the experienced 4 https://shelly.cloud/products/shelly-em-smart-home-automation-device/ forecasting performance (average among the ten instances) and its comparison with results obtained so far, while Fig.  20 compares predicted and real consumption values for one of the instances. Results (although preliminary) are very promising. Despite the model has been trained on a sevenyear-old dataset by using samples collected in a city from a different country, the performance is not too badly affected.

J. BENEFITS OF OUR SOLUTION: A RECAP
To summarize, in contrast to the centralized architecture recalled in Fig. 2, our proposed federated architecture is based on a decentralized approach that brings many benefits, thoroughly investigated through the paper and especially in this Section. First, our solution embraces the privacy-by-design principle, since it guarantees that fine-grained energy consumption measurements are kept at the edge (i.e., where they are generated) to mitigate any customers' privacy breach that could occur if such data would be transferred to a centralized location. However, in some specific cases of unbalanced data, privacy could still be breached by simply analyzing the shared model weights [70]. Techniques such as differential privacy can be straightforwardly adopted [71] [72] [73] to mitigate this issue, but this is left for future work.
Another important benefit consists on the higher training dynamicity at a lower computational cost. In fact, in the centralized architecture, a model that has to keep track of novel patterns (possibly occurring in the time-series data) needs to be periodically re-trained with massive amount of new samples, (i) requiring a significant processing effort at the centralized server and (ii) leading to high training times. Moreover, if the model is not frequently re-trained, it may not be able to forecast novel patterns for a long time (low adaptability). Our proposal is instead able to continuously re-train the model and dynamically adapt it to new patterns, while distributing the training effort among many Edge Computing Nodes that only require moderate processing capabilities. Moreover, since local training involves a much lower amount of data, per-round training times can also be kept low.
Finally, the federated architecture generally leads to a communication overhead reduction with respect to the centralized one, especially when samples are collected at a high granularity. In fact, only the LSTM cell weights need to be exchanged between the Aggregator and the Edge Computing Nodes, while fine-grained energy consumption measurements, which have generally a higher size, are kept local.

VIII. CONCLUSION
In this paper we proposed an architecture that adopts Federated Learning to unlock effective short-term load forecasting  at the edge of the network. The architecture allows multiple participants to collaboratively train an LSTM neural network while keeping sensitive fine-grained energy consumption measurement local, under the coordination of a centralized aggregation node. It also includes the possibility to use weather-and calendar-related information as input in the model training phase, and envisions the possibility to opportunely cluster participants with similar consumption trends or with similar socioeconomic conditions, with the proven benefit of enhancing the forecasting performance.
We thoroughly evaluated the proposed solution and compared it with a state-of-the-art architecture, where the LSTM network is trained in a centralized location using massive amounts of data as collected from the borders of the network. The results show that our proposal can lead to forecasting performance as good as the state of the art, but outperforms it in terms of model training time (of up to one order of magnitude) and privacy awareness. With respect to communication overhead, our method outperforms the existing approach when the consumption measurements are collected with resolution higher than one hour. We also demonstrated that our architecture comes with a set of parameters (that is, amount of pre-training data, number of selected participants per FL round, amount of training data) that need to be properly tuned to strike the best balance between forecasting performance, training time and communication overhead, and that our approach outperform other privacy-preserving training strategies, especially when energy consumption patterns rapidly change over time.
As future work we plan to extend the architecture to embed differential privacy principles, so that any model inversion attack can be mitigated. In this way, the participants' privacy can be preserved also when their locally-trained model discloses some information on the data used to train it. We also plan to implement a small testbed where the Federated Learning logic is executed on resource-constrained devices, so that we will be able to thoroughly evaluate our architecture in a real (and not simulated) deployment, and we plan to evaluate the forecasting performance when other clustering algorithms are adopted (e.g. DBSCAN) and/or other features are considered, possibly extracted through appropriate machine learning methods. Dr. Savi has been involved in some European research projects advancing access and core network technologies (FP7 COMBO, H2020 ACINO, H2020 DECENTER, H2020 GN4-3).

LIST OF ACRONYMS
FABRIZIO OLIVADESE is an IT Digital Innovation Specialist in Artsana Group, the leading Italian company on parenting, supplements, cosmetics and health space. In 2020 he received a Master Degree in Computer Science (with honors) from University of Milano-Bicocca, working on a thesis on the application of federated learning in the field of residential energy consumption forecasting. VOLUME X, 2021