Electricity Price Forecasting in European Day Ahead Markets: A Greedy Consideration of Market Integration

In this manuscript we explore European feature importance in Day Ahead Market (DAM) price forecasting models, and show that model performance can deteriorate when too many features are included due to over-fitting. We propose a greedy algorithm to search over candidate countries for European features to be used in a DAM price forecasting model, that can be applied to several regression and machine learning modelling methodologies. We apply the algorithm to build price forecasting models for the Dutch market, using candidate countries selected through an integrated analysis based on open-source European electricity market data. Feature importance is visualised using an Auto Regressive and Random Forest model. We explain these results using cross-border flow and DAM price data. Two types of models (LEAR and the Deep Neural Network) are considered for the DAM price forecasting with and without European features. We show that in the Dutch case, taking European market integration into account improves the Mean Absolute Error (MAE) of the best performing DAM price forecasting model by 3.1%, the relative MAE (rMAE) by 3.85%, and the Symmetrical Mean Absolute Percentage Error (sMAPE) by 0.31 p.p., compared to the best forecasting model without European features. Through statistical testing we show that European features improve the accuracy of the forecasts with statistical significance.


DM-test
Diebold-Mariano test, a statistical test to compare accuracy between forecast timeseries.

I. INTRODUCTION
Demand response (DR) is an energy management technique where energy consumers are incentivised to shift their energy use in time [5], [6]. In market based DR, the incentive is economic through variable pricing. The electricity price reflects the availability of (renewable) energy through scarcity of a product. Low prices correspond with an energy surplus and low marginal cost of energy, where high prices correspond with energy scarcity and high marginal cost of energy. The introduction of spot markets allows for a different electricity price in every hour, 30-minutes and even 15-minutes. In Europe, the Day Ahead Market (DAM) is the main spot market for trading electricity. On the DAM, energy is traded in hourly blocks with corresponding prices. The DAM is well studied in the context of Electricity Price Forecasting (EPF) [7]. The European grid is well connected and the interconnectivity of countries is expected to increase in the future. In 2002, members of the EU vowed to have 10% of their generation capacity in cross-border transmission capacity [8]. A future ''European Supergrid'' could even connect the European grid to North Africa [9]- [11]. Inter-connectivity of the European grid will help reach greater efficiencies, improve resilience to climate change and will enhance energy flexibility [12]. The development of European infrastructure will improve internal market efficiencies, enhance security of supply and enable Renewable Energy Sources (RES) market penetration [8]. Inter-connectivity enables cross-border electricity trading, opening up national electricity markets to foreign demand. These facts already make a strong case for the consideration of European market integration in a DAM price forecasting model. French market features (e.g. load, prices, generation), for example, have been shown to be more important than Belgian features when forecasting the Belgian DAM prices [13]. However, the analysis was limited to one neighbouring country and only a Deep Neural Network (DNN) was applied. Similarly, a study [14] has shown that including Energy Exchange Austria's (EXAA) prices as features improves forecast accuracy in all other EU DAM markets, especially in the Dutch and French DAM forecasting models. However, the connection between the Dutch DAM and the EXAA is also shown to be getting weaker over time. Possibly due to the further integration of the Dutch market with the APX UK market. In another study [15], price forecasts of other European markets are used as exogenous input for an Italian DAM forecasting algorithm, significantly improving performance. The effect of European market integration is not only seen in the DAM, but also in the Intraday Market (IDM) [16]. In general, the effect of European market integration is understudied in the context of EPF.
Machine Learning (ML) techniques have been shown to be effective at forecasting electricity prices, both in the DAM [7] and the IDM [17], [18]. Specifically, the MultiLayer Perceptron (MLP) has been successfully applied to forecast DAM prices in Spain and Pennsylvania-New Jersey-Maryland [19], [20]. The more complex structured Deep Neural Network (DNN) was successfully applied in Belgium [21], Nord-Pool markets, Germany, France and Pennsylvania-New Jersey-Maryland [22].
Many other methods have been applied to forecast DAM prices [7]. According to an EPF benchmark study on the Belgian market, a two-layer DNN generally outperforms both statistical and other ML models, given the same features [21]. The Lasso Estimated Auto Regressive (LEAR) [22], or fARX-Lasso [3], is the best performing non-ML method and should be considered as a benchmark [22], especially since ML methods have been shown to not always outperform statistical models [21].
The performance of the different modelling approaches varies over different markets [22], but a recent case-study on the Dutch market is missing in literature. Many statistical methods rely on the calibration of linear relationships. While they can still be powerful modelling approaches, they might not perform well with high-resolution data like hourly prices with high volatility [21]. Price volatility in electricity markets can be pronounced due to the continuous need for balancing supply and demand [23]. It can differ significantly per month, and is subject to external price drivers like energy demand, demand elasticity, congestion, (renewable) energy generation, fuel prices, currency exchange rates and inter-connected electricity markets [13], [23]- [25]. With an increasing renewable energy penetration in electricity markets, price volatility is expected to increase due to the intermittent nature of renewables, in the absence of sufficient energy storage [25]. However, effective policy, cooperation between TSO's and intraday trading can presumably prevent a significant increase in price volatility on the DAM [26]. It is possible that as renewable energy penetration increases, price volatility would increase and ML methods would increasingly outperform statistical methods.
To summarize, market integration can be expected to play a large role in price settlement of European Day Ahead Markets. The increasing inter-connectivity may even increase external market influences on national markets. The current state of the art in EPF generally limits market integration features to a single external market, while the inter-connectivity of the grid and different local conditions would make it likely for national markets to be affected by multiple external markets. Also, little clarification is given on the actual market mechanisms that make the markets affect each other. In recent literature, a case study is missing on the Dutch market. And while it is expected that ML methods perform better than statistical methods in times of high price-volatility, this has not been confirmed with statistical significance.
In this manuscript we perform an EU wide, data-based analysis of European market feature (e.g. price and load) importance in DAM price forecasting models of European bidding zones. Open-source data from the ENTSO-E transparency platform [27] was used exclusively. Yearly cross-border flows are analysed, after which we apply a Least Absolute Shrinkage and Selection Operator (LASSO) [28], [29] estimated Auto Regressive model, and a Random Forest [1] model to identify European feature importance (Section IV). The results are used to select candidate European countries whose features are to be considered in a Dutch DAM forecasting model. We perform a benchmark using only Dutch features (Section V-B). Then, we propose a greedy approach to searches through candidate countries for the best performing combination of features for a DAM forecasting model (Section V-A). We apply two different types of models in the search for market integration features: the LEAR [3], [22] and the DNN. The LEAR is a linear regression model estimated with the LASSO. The DNN will be applied in two different configurations: the single-market DNN [21] (SM-DNN) and the multi-market DNN [13] (MM-DNN). The SM-DNN [21] is a neural network that can forecast all 24 hourly DAM prices of a bidding zone simultaneously. The MM-DNN is a neural network that forecasts the 24 hourly prices of multiple bidding zones simultaneously. Temporal variations in relative model forecasting performance will be analysed (Section V-D) using univariate Diebold-Marianotests (DM-tests) on the hourly forecasts, in combination with Kernel Density Estimates of the daily DAM prices per hour of the day.

II. ELECTRICITY PRICE FORECASTING METHODS
In order to test whether our proposed methodology for including market integration features in a Dutch price forecasting model leads to a performance increase, several benchmarking models are proposed. Using a naive forecast as general benchmark, we compare its performance with 4 other Dutch price forecasting models without European features. The full AutoRegressive (fAR) and the fAR with Exogenous variables (fARX) [2], [3] are compared with the Lasso Estimated AutoRegressive (LEAR) [3], [22], consisting of the LASSOor EN-estimated fARX. In this study we considered both the LASSO-and EN-estimated LEAR, and selected the best performing model based on a preliminary analysis. Since the Deep Neural Network (DNN) has recently been shown to be the best performing model in an EPF benchmark for the Belgian market [21], we include it in the benchmark for a Dutch price forecasting model.
The best performing models without European features will be used in the greedy algorithm in order to quantify their performance increase due to the inclusion of European features. The DNN will be included since it has already successfully been applied to improve the accuracy of the Belgian price forecast using French market features [13]. However, we consider two configurations of the DNN. First, we apply the SM-DNN with European features. The SM-DNN is used to forecast the 24 hourly Dutch DAM prices only, while European features are included in the input. Second, we apply and the MM-DNN, where the 24 hourly DAM prices of multiple bidding zones are forecast in the same model [13]. The MM-DNN is depicted in Figure 1, where the model is applied to forecast the Dutch and N other bidding zones' hourly DAM prices. We evaluate all models on their forecast accuracy of the Dutch DAM prices only.

A. LASSO ESTIMATED AUTO REGRESSIVE MODEL
The Lasso Estimated Auto Regressive (LEAR) is a fARX-model estimated using the LASSO (Appendix A1) [3], [22]. For DAM forecasting considered here, the fARX is given by the following formula: where X d,h,i are the n = 106 regressors or features in Equation (1). The model is trained by solving the least-squares estimate of the model, with either an l 1 -norm for the LASSO-estimation or an l 1 l 2 -norm for the EN-estimation. With sufficiently large penalty term some weights will become (near) zero [30], effectively performing variable selection. More information on the LASSO-or EN-estimation can be found in Appendix A1.

B. DEEP NEURAL NETWORK
A DNN is a feed forward Artificial Neural Network (ANN) that is trained using back propagation. It contains a minimum of 3 layers: an input layer, an output layer and at least 1 hidden layers. Following the standard notation in EPFs, we will use the name MLP for ANNs with 1 hidden layer and DNN for ANNs with 1 or more. Although the definition of what is ''deep'' might be debatable, it is out of the scope of this paper and we simply follow the current standard terminology used in literature [13], [21], [22].
The DNN is trained with the Adaptive Moment Estimation (ADAM) [31]. Figure 1 shows an MM-DNN with 2 hidden layers (z 1 , z 2 ), where n f is the number of input features in the input layer x, n 1 and n 2 are the number of nodes in hidden layer 1 and 2, respectively. The output layer (P) consists , with x ∈ (1, . . . , N) and h ∈ (1, . . . , 24)).
If N = 0, the DNN forecasts a single market's prices, the Dutch prices, only. If N = 1, the dual-market DNN as proposed in [13] results.
of 24 nodes for each country included in the forecast, one for each hour of the day (h). In the case of Figure 1, the DNN is used to forecast Dutch DAM prices together with a variable number (N ) of other bidding zones. In the case of a DNN with one hidden layer, the output layer is fully connected to the first hidden layer.
Features are optimised as hyperparameters similarly to [13], where a single binary choice-option for country load features is added to the search space. In our case these are the historic load, and load forecast of the countries considered in the forecast, which are added or removed through a single binary choice-option per country. Historic prices are always included as model features.
To optimise hyperparamerters and features, the Treestructured Parzen Estimator [4] (TPE) algorithm is used. The TPE is a Bayesian Sequential Model Based Optimisation (SMBO) methods, which can efficiently be applied to optimise hyperparameters [32], [33]. The method consists of a surrogate model that is built using Bayes rule, describing the probability of model performance as a function of the hyperparameters where y is model performance and x is a hyperparameter instantiation.
The surrogate model is used to estimate model performance as a function of features and hyperparameters. In the TPE algorithm, p(x|y) is defined such that where l(x) is the density formed by observations x (i) such that the corresponding loss, f (x (i) ), is less than a threshold (y * ). h(x) is the density formed by the remaining observations. The TPE algorithm samples candidates from l(x), after which l(x) is evaluated for each sample in order to suggest the candidate with the highest expected improvement.

C. DIEBOLD-MARIANO TEST
The DM-test is a statistical measure to compare accuracy of two forecast timeseries. Given a target timeseries (y t ) and two forecasts (ŷ 1t ,ŷ 2t ), the error timeseries (e i,t ) are calculated (Equation 5a). In this manuscript we apply the Mean Absolute Error (MAE) (Equation (5b)) as loss function (L). Based on this, a loss differential timeseries (δ t ) is calculated (Equation 5c), which is then used to perform the one-sided DM-test.
The one-sided DM-test tests the null hypothesis (H 0) that forecast timeserie 1 is more accurate than or equally accurate as forecast timeserie 2. The alternative hypothesis (H 1) is that forecast 2 is more accurate than forecast 1. In the context of DAM forecasting, the DM-test can be applied to either the full DAM timeseries (e.g. multivariate DM-test) or to evaluate forecasting performance for each hour separately (e.g. univariate DM-test).
In order to include EU market integration in a bidding zone's DAM forecasting model, we propose a greedy algorithm as defined in Algorithm 1. We apply the algorithm to prevent over-fitting on the large amount of available data by selecting only the relevant features. Starting with a candidate set of features (e.g. 1 contains only Dutch features), a model (M i ) is trained and its hyperparameters are optimised. The model is evaluated to test its predictive performance (p i ) on the bidding zone of interest's prices, in our case the Dutch bidding zone's, and added to a set (P i ) together with model performance (p i ) and features ( i ) of the current iteration (i). After the first iteration, all candidate countries ( ) are consecutively added to the latest feature set ( i ), to be used in the price forecasting model. The model is trained, its hyperparameters are optimised, and its forecasting performance on the bidding zone of interest (P j ) is evaluated and added to set (P). After  19 return P all candidate countries are evaluated, the best performing model's features are selected from the set (P). Those features are added to the feature set ( i ), while they are removed from the candidate feature set ( ). The procedure will then be repeated until either performance stops increasing, or all candidate features are added to the model.
The candidate set of features ( ) was selected using the data analysis described in Section IV, where all countries that result from at least one of the analyses (cross-border flow, ARX or RF) are selected. The candidate set of country features used for the Dutch market are summarised in Table 1.
Our proposed greedy algorithm can be applied to any model that allows for a large amount of features, and assists with preventing over-fitting. In this manuscript we independently apply the algorithm to three different modelling approaches (Section V-C). We perform both hyperparameter optimisation and feature selection on the validation set of the Dutch DAM prices. Finally, all models are evaluated on the test set to evaluate the increase in model performance. No model selection is performed on the test set.

III. DATA AND TOOLS
The data used for this study is taken from the ENTSO-E transparency platform [27]. Historic DAM price-series, historic loads and historic day ahead load forecasts are used. The data was split in train-, validation-and test-sets, as depicted in Figure 2. Data from 2015-2019 was used, with the year 2019 being used as test data. In order to be able to include the most recent information in training, 2017 was taken as validation data instead of 2018. A single split limits the data leakage that occurs in the first week of the 2018, where data from the validation set is used in the lagged features. There still is no data leakage from the test set. The LEAR models are build and trained, and the features were Z-score standardised using the scikit-learn python package [34]. The DNN was made using Tensorflow [35] and trained with the ADAM optimiser [31]. The TPE SMBO algorithm is implemented using the Python Hyperopt package [4]. The DM-tests are performed using the python EPF-toolbox [22].

IV. EUROPEAN FEATURE IMPORTANCE ANALYSIS
In order to identify possible European features that influence the Dutch market, a EU-wide analysis is performed on cross-border flow data and DAM prices.
European cross-border flow data [27] was analysed to identify trading patterns between countries. Figure 3 shows the cross-border flows between countries whose data is available through the ENTSO-E transparency platform. The data shows that France is the largest net exporter of the EU, while Italy is the largest net importer of the EU. The Nordic markets (FI, SE, DK and NO) seem to be a relatively independent block, mostly trading amongst each other. For each bidding zone, a LASSO-estimated ARX-model is trained and its weights are analysed. The model is given by the following formula: where N is the number of bidding zones, M the number of countries, p i,d,h the price of bidding zone i and z j,d,h the load forecast of country j, on day d and hour h, β are the weights or coefficients, and ε d,h the model error. The analysis allows us to gain insight in the relative importance of European features with respect to domestic features. The analysis fits the general EPF modelling approach, where historic prices and the load (forecast) are used to forecast next day's price. Figure 4a shows the normalised weights resulting from the ARX-models. The incoming arrow width indicates the relative weight of transmitting country features for the receiving country's price forecasting model. For countries containing multiple bidding zones, the weights of the bidding zones are summed. Weights are masked at a minimum of 5% normalised importance.
Since the LASSO estimated ARX is a linear model, it is possible that non-linear relationships are not well reflected in the analysis. To overcome this limitation, a RF-model is applied in similar fashion. The RF is trained to forecast the 24 hourly DAM prices of a bidding zone simultaneously, using the same features as the ARX-model. The RF is able to capture non-linear relationships, and can similarly show relative features importance. More information on the RF model can be found in Appendix A2. Figure 4b shows the normalised feature importance resulting from the RF-models. The RF seems to give a higher importance to domestic market features compared to the ARX.
Both the ARX-and RF-analyses show that Norwegian market features seem to perform well in many other European price forecasting models, while Norwegian models perform well using only Norwegian features. Compared to Sweden, which exports more in an absolute sense, Norwegian features perform well in a large number of forecasting models of other European markets. This could be do to the amount of storage present in the Norwegian system, possibly allowing them to export more dynamically. It would enable them to export at times the merit-order curve in other bidding zones is relatively high and steep, expressing more influence on price settlement. Another explanation could be the relative large transmission losses that come with the oversea cables that connect Norway to many other European markets, making high foreign prices necessary to make up for the transmission losses. Figure 5 shows the correlation matrix of Norwegian export and the DAM Prices of the importing markets. The figure shows that Norwegian exports correlates positively with the importing market's DAM prices. The correlation is the strongest in Denmark and Sweden, but is positive for all neighbouring markets. The positive correlation between Norwegian export and Swedish prices is as positive as to other connected market prices. This indicates that the difference between overland-and oversea-connected markets, and as such the possible effect of transmission losses on export behaviour, can't be observed to have an effect on the correlation between Norwegian export and the neighbouring price.
The Spanish and Portuguese markets seem to function as an independent block from the other EU markets, with French features having low importance in their price models. This could partially be explained by the Pyrenees separating them, which makes overland transmission cables relatively expensive [36]. This would result in the relatively low degree of interconnection of Spain with other EU member states [37].
Even though France is the largest net exporter of electricity, its feature importance in neighbouring market models is relatively low. One explanation could be the large share of nuclear power in France, making its export possibilities relatively inflexible. This makes it possible that the information in French features is captured in the historic prices of those markets, and less information from France is needed to quantify its effect on price settlement. Figure 6 shows the correlation matrix of French electricity export and the importing market's DAM prices. The figure shows that the correlation between electricity export and DAM prices is weak, and in  some cases negative. This can similarly explain the relative low importance of Italian features in other markets' models, regardless of its high net import. Figure 7 shows the correlation matrix of Italian import and the transmitting market's DAM prices. Again, only weak correlations are observed. A constant Italian demand in foreign markets (e.g. Slovenian, Swiss and French) could be captured in the historic prices of those markets, and less Italian information is needed to quantify this effect. Although cumulatively, Italian feature importance is relatively high in both the ARX-and RFmodels.
Germany and Switzerland trade a significant amount of electricity, but their features are not selected each other's price forecasting model. Germany has a large renewable energy share, making a constant trading behaviour less likely. However, it could be that a part of the export is from intraday trading. If that is the case, cross-border training would have a smaller effect on DAM prices.
Both the ARX- (Figure 4a) and the RF-analyses (Figure 4b) show there is no straightforward relationship between cross-border flows and feature importance in a foreign price forecasting model. Norwegian features have high importance in many markets, while French feature importance is limited. France exports a lot more electricity than Norway, but their inter-connectivity, means of generation and as such their inherent flexibility are vastly different. Which can also be seen in the correlation matrix of their export and the importing markets' DAM prices. Italian features show up in many ARX-models (Figure 4a), but with low importance compared to other markets regardless of its large net import. A possible explanation is the ability of a bidding zone to time its import and export with advantageous prices. However, more research is necessary to conclude this definitely.
For the Dutch price model, Table 1 summarises the results of the analysis perform in this section. It stands out that directly connected markets (BE, FR, NO and GB) arise from the ARX-and RF-analyses. The ARX-and RF-models identify second-order market effects through Danish (DK), French (FR) and Italian (IT) features. German features are considered due to the relatively large amount of trading between the two markets. For the Belgian, British and German price forecasting models, Dutch features seems to contain information on their prices as well. However, for the French, Italian, Danish and Norwegian models, Dutch features are not of high importance.

V. DUTCH DAM PRICE FORECASTING WITH EU MARKET INTEGRATION
To search for the best combination of European features for several Dutch price forecasting models, we apply the greedy algorithm (Algorithm 1) to three modelling approaches independently. The algorithm will be applied to search over the previously defined candidate countries ( ). Table 1 shows the summarised results for the European feature candidate selection, applied to the Dutch market. For countries containing multiple bidding zones (NO, DK, IT), a single bidding zone (NO-2, DK-1, IT-NORD) is selected for the remaining analysis.

A. MODEL TRAINING
For model training of the LEAR and DNN, two slightly different approaches are taken. In the LEAR models, all fARX-variables are considered possible features for the candidate countries. The penalty-factors (λ, ρ) are optimised on the validation set. For the DNN, a feature-and hyperparameter search is performed using the TPE algorithm (Section II-B). The search space ( ) that we have defined for the DNN's is shown in Table 2. The best performing models, evaluated on the validation set, are selected for all modelling approaches.

B. DUTCH BENCHMARK
In order to be able to say whether the inclusion of European features improves the performance of the Dutch forecast, a benchmark is performed using the Naive, fAR, fARX [2], [3], LEAR [3], [22] and DNN [21] models without European features. For the fAR, fARX and LEAR, a single model is trained for each hour of the day. While the DNN is applied to forecast the 24 prices simultaneously.
The results in Table 3 show that the LEAR sets a tough benchmark to beat, outperforming the DNN. Figure 8a shows the results of the multivariate DM-test. The test confirms that the LEAR model's forecast is significantly more accurate than all other forecasts using only Dutch features, while the DNN's forecast is significantly more accurate than all others' but the LEAR's.

C. GREEDY SEARCH FOR EUROPEAN FEATURES
Algorithm 1 is applied to search for the best feature combination in a Dutch DAM forecasting algorithm. The LEAR and DNN models are applied in the search for European features, where the DNN is both applied to forecast the Dutch price only (SM-DNN), and as a multi-market forecasting model (MM-DNN). All models are evaluated for their ability to forecast Dutch DAM prices only. Table 4 shows the forecast metrics of the LEAR models, showing an increase in accuracy as features from more countries are included. Model performance stopped increasing after 5 added countries (GB, DK, NO, FR and DE). The largest improvement can be found after the second and fourth iteration, when Danish and French features are included in the model. The uneven increase in VOLUME 9, 2021  performance could mean that there are system effects in price settlement, which are hard to quantify using only two markets. The LEAR S all model using all European features shows similar performance to the LEAR with features from set S L,1 . It still has improved performance over the LEAR model with only Dutch features. This can be explained by the strong regularisation ability of the LASSO, effectively enforcing sparsity in the solution. However, there is still some degree of over-fitting resulting in a lower loss than most models with less features. Figure 8b shows the results of the multivariate DM-test of the forecasts. It shows that as features from more countries are included in the model, the performance increases significantly. This leaves the LEAR with features from set S L,5 as statistically significant best performing LEAR-model. Table 5 shows the results from the European feature search applied to the SM-DNN. No significant increase can be seen after the first iteration, where British features are added to the model. After three iterations, test performance increased with respect to the Dutch-only model. However, the fourth iteration resulted in the model with the lowest test loss. Interestingly, the same countries except Germany are selected as in the LEAR model in different order. Model performance decreased after the second iteration, while the validation loss decreased. This could indicate that relevant market dynamics can change depending on the year. The inclusion of Norway lead to the largest loss decrease, which can possibly be explained by Norway's high feature importance found in Section IV. Figure 8c shows the results of the multivariate DM-test between the forecasts of the SM-DNN's. It shows little statistical significance in improved forecast accuracy. The model with features from set S D,4 does show some significant performance increase. The SM-DNN with features from set S D,4 has the most convincing p-values for improved forecast accuracy compared to models with features from sets NL, S D,1 and S D, 3 . In general, the forecast error decreased by including more European features. Even though not all improvements are statistically significant, the final model including features from set S D,4 is arguably better than the DNN with only Dutch features. The SM-DNN S all model with all European features shows a large performance decrease, even compared with the SM-DNN using only Dutch features. This indicates that even though the degree of l 2 regularisation is taken as a hyperparameter, the model is hard to train with this amount of features. Table 6 shows the performance of the models generated in the European feature search for the MM-DNN. The model is trained by minimizing the MAE of the price forecasts over all countries included in the model. Naturally, tradeoffs are made between, for example, Dutch and French price forecasting accuracy. This could both lead to a performance decrease due to trade-offs, or to a performance increase by preventing over-fitting. The greedy algorithm select the same markets in the same order as the single market DNN, but stops one iteration sooner.    The MM-DNN with features from S D,2 and S D,3 outperform the single market DNN with the same features. Indicating that forecasting prices of multiple markets could indeed lead to better single-market performance. Forecast accuracy generally improved with the inclusion of more European features. The MM-DNN with countries from set S D,3 is significantly better than all other MM-DNN models and the DNN using only Dutch features. Although the significance of the improvement from S D,2 to S D,3 is debatable. The MM-DNN S a ll model with all European features shows a large performance decrease. The rMAE larger than 1 shows the forecast performance is lower than the naive model. The larger performance decrease in the MM-DNN could be explained by a trade-off between target accuracy's. The loss function while training the model combines the losses for all bidding zones, which could result in a trade-off between price forecasting performance in the Dutch and other bidding zones. Table 7 shows the countries whose features are added in every iteration and for all applied modelling approaches. Table 8 shows the improvements of the applied modelling approaches when European market integration features are included in the model. All approaches show improvements with an increasing number of market integration features. For the Netherlands, the first countries whose features are included by the greedy algorithm are Great Britain and Denmark for all modelling approaches. French features are included in all modelling approaches, and in some cases Norwegian and German features are included as well. The third country added by the greedy algorithm differs between the LEAR and the DNN's. In the LEAR, Norway is chosen as third external market, while in the DNN French features are included. This could indicate that the relationship between the Norwegian market and Dutch prices are more linearly approachable than the relationship between the Dutch price and French market. A possible explanation is that the effect of the dynamics between the French and British market on the Dutch price, are better captured in the DNN's.
The largest improvements are found in the MM-DNN, but it doesn't perform better than the LEAR. The large relative improvement of the MM-DNN could mean that market integration dynamics are better captured by non-linear relationships, resulting in a larger possible improvement in non-linear models in comparison to linear models.
The SM-DNN doesn't perform as well as the MM-DNN, which could be explained by two factors: during training, it is possible that the MM-DNN's were able to overcome local optima when different features are included. This would result in improvements in accuracy that are not due to the information in the data, but due to the training process. Another factor could be the prevention of over-fitting by generalizing tasks [13]. However, extensive optimisation of the DNN-type model performance was not within the scope of the research. Meaning that it is possible that other DNN configurations or other training methods could result in DNN models that outperform the LEAR.

D. TEMPORAL VARIATION IN PERFORMANCE
While the multivariate DM-tests give a good relative representation of forecast accuracy, electricity prices are known to be more volatile in some hours than others. Figure 9 shows the Kernel Density Estimates (see Appendix A3) of the Dutch DAM prices in August and October 2019. It can be seen that the DAM price behaves differently throughout the day, with low price volatility in the early morning and late evening, and more volatile prices during midday. Figure 10 shows the univariate DM-test of LEAR model with features from S L,5 against the benchmark LEAR and DNN, DNN S D,4 and MM-DNN S D, 3 . It can be seen that the LEAR forecasts are more accurate than the DNN models' forecasts mostly at hours with low price volatility. Figure 11 shows the same DM-test  but applied to model MM-DNN S D, 3 . It can be seen that the MM-DNN forecasts are more accurate than the LEAR type model forecasts at times with some degree of volatility, but not when it is too high. These months have been chosen for the sake of simplicity, the results can also be seen in other months. Renewable energy generation forecasts are not included as features, but could assist with forecasting highly volatile prices. The analysis indicates that as prices are becoming more volatile, stochastically trained non-linear models start outperforming mathematically optimised and regularised linear models. This makes the DNN modelling approach worth considering for the future Dutch DAM forecast, as it is possible that electricity prices will become more volatile in the future [25].

VI. CONCLUSION
In this manuscript, we propose a method for searching optimal combinations of European features in Day Ahead Market (DAM) price forecasting models. By using the Dutch market as a case study, we show that taking European market integration into account in regression and machine learning models can improve the forecasting performance of the best performing DAM models. We identify and visualize European feature importance, showing that flexibility in the energy system might affect a country's feature importance. This is confirmed by the correlation between electricity export and DAM prices. We have proposed a greedy algorithm to search for European features in a DAM price forecasting model, using candidate countries resulting from an analysis on before-mentioned feature importances. The proposed greedy algorithm improved forecasting performance of all proposed modelling approaches with statistical significance.
The algorithm was applied to the Dutch market, using three different modelling approaches: the LEAR [3], [22], the Single-Market Deep Neural Network (SM-DNN) [21] and the Multi-Market DNN (MM-DNN) [13]. A benchmark was performed using the LEAR and DNN including only Dutch features. The algorithm was able to identify features while preventing over-fitting on the data, which did happen in the models when including all European features. The greedy algorithm first chose British and Danish market features, after which either Norwegian or French were included in the model, differing between the LEAR and the DNN's. The LEAR model, which is known to have good regularisation properties, performed best when features from 5 other countries are included in the model. The SM-DNN model performed best with features from 4 other countries, and the MM-DNN performed best with features from 3 other countries. The largest improvement was found in the MM-DNN, which can be explained by either the training process (as discussed in Section V-C) or the non-linear nature of market dynamics. However, in the Dutch case the LEAR model with market integration features is the overall best performing model. Temporal variations in model performance are shown, and a relationship between price volatility and relative model performance was found. The LEAR generally performs better than the DNN during hours with low volatility, while the DNN-type models show some improved performance over the LEAR in hours with high volatility. This could indicate that as prices are expected to become more volatile in the future [25], non-linear models could outperform linear models. The LEAR model performed relatively well using all European features, which can be explained by its strong regularisation properties. The DNN type models that used all European features showed a decrease in performance, also with respect to using only Dutch features. The MM-DNN showed the largest performance decrease, which could be due to the multi-market loss function. This could result in trade-offs between price forecasting performance on the Dutch market and other bidding zones. In future work, other modelling approaches could be used in the proposed greedy algorithm. Also, different lengths of the training data could be used, other hyperparameter configurations of the DNN could be set, and model ensembles could be made, in order to optimise forecasting performance. More features could be used, like renewable energy generation forecasts. Weights could be added to the MM-DNN loss function to optimize Dutch price forecasting performance. Other years of data can be used to train and test the models, allowing the investigation of temporal changes in market dynamics. Also, the analysis could be applied to different European markets to test the sensitivity of a certain market's model accuracy to market integration features. This would also allow for an extended analysis on the relative performance of linear and non-linear models under different scenarios of price volatility.

A. ADDITIONAL METHODOLOGY
In this Appendix we present an extended description of some of the methodologies used.

1) LASSO AND ELASTIC-NET
The Least Absolute Shrinkage and Selection Operator (LASSO) [28] is a least squares linear regression model, where an additional l 1 -norm penalty is applied to the weights in order to regularise the model and enforce sparsity of the solution. The LASSO solves the optimisation problem where z ∈ R N ×1 is a vector of outputs, ∈ R N ×p is a matrix of features and θ ∈ R p×1 is a vector of weights. The regularisation parameter λ ∈ R can be used to control the trade-off between sparsity of the solution and the approximation error. The Elastic-Net (EN) is a linear regression model fit with both an l 1 and l 2 penalty norm on the weights of the model. During model selection, both the penalty gradient on the weights (λ) and the l 1 l 2 ratio (ρ) are optimised. A ρ of 0 would result in a Ridge Regression problem, while a ρ of 1 results in a LASSO. The objective function when fitting the EN is: where variables have the same representation as in the LASSO objective (8), and ρ represents the l 1 l 2 ratio.

2) RANDOM FOREST
The Random Forest (RF) algorithm was developed by Breiman [1]. It is an ensemble classification or regression approach using random regression trees, hence the name Random Forest. The RF is a collection of tree-predictors whose output -in the case of regression -is an unweighted average of the outcomes of the separate trees: where h is the averaged output of regression trees h(x; θ k ) with input x, feature space θ, subspace θ k ⊂ θ and K regression trees. The RF estimates a features's importance based on the amount of splits in the trees in comparison to the amount of samples the feature splits. It is also referred to as the 'Gini importance' or the 'mean decrease impurity' [34].

3) KERNEL DENSITY ESTIMATION
A kernel density is a non-parametric method that can be used to estimate the probability density function (PDF) of a random variable. If (x 1 , x 2 , . . . , x n ) is a sample from the univariate distribution f , its kernel density estimator is then described byf where K is the kernel (in our case we have applied the Gaussian kernel function) and h is the bandwidth (in our case this has been selected using Scott's method [38]). The KDE has been applied using the Scipy python package [39]. In 2019, he started working as a Water and Energy Consultant with HKV Consultants, Delft, while being active as a Researcher at Delft University of Technology. His research interests include the application and economic potential of demand response in open canal systems. His research interests also include forecasting, electricity markets, European market integration, demand response, and water system control.
JESUS LAGO received the B.Sc. degree from the University of Vigo, Spain, in 2013, the M.Sc. degree from the University of Freiburg, Germany, in 2016, and the Ph.D. degree in machine learning applied to the energy transition from Delft University of Technology, The Netherlands, in 2020. He is currently an Applied Scientist with Amazon working on forecasting algorithms and recommender systems. His research interests include forecasting, time series algorithms, recommender systems, and ranking algorithms.
PETER PALENSKY (Senior Member, IEEE) received the M.Sc. degree in electrical engineering and the Ph.D. and Habilitation degrees from Vienna University of Technology, Austria, in 1997, 2001, and 2015, respectively. He cofounded an Envidatec, a German startup on energy management and analytics, and joined the Lawrence Berkeley National Laboratory, Berkeley, CA, USA, as a Researcher, and the University of Pretoria, South Africa, in 2008. In 2009, he became appointed as the Head of Business Unit on sustainable building technologies at the Austrian Institute of Technology (AIT), and later the first Principle Scientist of complex energy systems at AIT. In 2014, he was appointed as a Full Professor of intelligent electric power grids with TU Delft. He is active in international committees, such as ISO or CEN. His research interests include energy automation networks, smart grids, and modeling intelligent energy systems. He also serves as an IEEE IES AdCom Member-at-Large in various functions for IEEE. He is also the Editor-in-Chief of IEEE Industrial Electronics Magazine, and an associate editor of several other IEEE publications, and regularly organizes IEEE conferences.
EDO ABRAHAM received the M.Eng. degree in electrical and electronic engineering and the Ph.D. degree in control engineering (aeronautics) from Imperial College London, U.K., in 2008 and 2013, respectively.
From 2014 to 2016, he was a Research Associate with the Environmental and Water Resources Engineering Group, Imperial College London. Since 2016, he has been an Assistant Professor of water and control with the TU Delft's Faculty of Civil Engineering and Geosciences. His research interests include the application of mathematical optimization, control and systems theory to advance water management and environmental engineering, and their nexuses with energy and agricultural systems. His application areas span from open canal water systems, irrigation to space heating. He also serves as an Associate Editor for the Energy Systems journal (Springer).