Bad and good errors: value-weighted skill scores in deep ensemble learning

In this paper we propose a novel approach to realize forecast verification. Specifically, we introduce a strategy for assessing the severity of forecast errors based on the evidence that, on the one hand, a false alarm just anticipating an occurring event is better than one in the middle of consecutive non-occurring events, and that, on the other hand, a miss of an isolated event has a worse impact than a miss of a single event, which is part of several consecutive occurrences. Relying on this idea, we introduce a novel definition of confusion matrix and skill scores giving greater importance to the value of the prediction rather than to its quality. Then, we introduce a deep ensemble learning procedure for binary classification, in which the probabilistic outcomes of a neural network are clustered via optimization of these value-weighted skill scores. We finally show the performances of this approach in the case of three applications concerned with pollution, space weather and stock prize forecasting.

Bad and good errors: value-weighted skill scores in deep ensemble learning Sabrina Guastavino, Michele Piana, and Federico Benvenuto Abstract-In this paper we propose a novel approach to realize forecast verification.Specifically, we introduce a strategy for assessing the severity of forecast errors based on the evidence that, on the one hand, a false alarm just anticipating an occurring event is better than one in the middle of consecutive non-occurring events, and that, on the other hand, a miss of an isolated event has a worse impact than a miss of a single event, which is part of several consecutive occurrences.Relying on this idea, we introduce a novel definition of confusion matrix and skill scores giving greater importance to the value of the prediction rather than to its quality.Then, we introduce a deep ensemble learning procedure for binary classification, in which the probabilistic outcomes of a neural network are clustered via optimization of these value-weighted skill scores.We finally show the performances of this approach in the case of three applications concerned with pollution, space weather and stock prize forecasting.
Index Terms-Forecast verification; ensemble learning; deep learning

I. INTRODUCTION
Predicting events over time applies to a number of fields, ranging from weather [1] and space weather Sabrina Guastavino, Michele Piana and Federico Benvenuto are with the Dipartimento di Matematica, Università di Genova (e-mail: guastavino@dima.unige.it;piana@dima.unige.it;benvenuto@dima.unige.it).Manuscript received ..; revised ... forecasting [2], through environment [3] to stock market forecasting [4].In all these frameworks, the goodness of prediction is usually defined in terms of the correspondence between forecasts and observations and is known in the literature as the forecast quality [5].For binary predictions, forecast quality is typically measured by using skill scores based on a confusion matrix whose entries count the number of false and true negatives and of false and true positives.Specifically, these skill scores rely on simple arithmetic formulas that compute in different ways the imbalance of the diagonal entries (representing the correct prediction) with respect to the off-diagonal entries (representing the incorrect prediction).Typical skill scores for forecast quality are the True Skill Statistic (TSS) [6], the Heidke Skill Score (HSS) [7] and the Critical Success Index (CSI) [8].However, there is a different perspective to evaluate forecast, i.e., in terms of its usefulness to support the user while making a decision.This type of goodness is known in the literature as the forecast value [9] and two examples of scores for the quantitative evaluation of this prediction goodness are the Cost Value Score and the Relative Value Score introduced in [10].Moreover, the field of cost-sensitive learning is devoted to take the misclassification costs (and possibly other types of cost) into consideration [11].These skill scores give an extrinsic value of the arXiv:2103.02881v1[cs.LG] 4 Mar 2021 prediction as they are based on a cost-benefits ratio belonging to the sphere of decision making processes outer of the prediction itself.To compute such skill scores is necessary to quantify the cost, the loss and the benefits of taking (or not) an action.
In this paper we focus on binary predictions performed over time and we propose a novel approach to evaluate the severity of prediction errors by considering that a false alarm predicting that an event will occur just before its actual occurrence, anyway eases the right decision from the user, while a delayed alarm is of little use.In this way, we classify errors on the basis of their importance and we get a novel notion of confusion matrix and related skill scores.In this new framework, the severity of errors depends on their impact on the decision making process and therefore we refer to these novel confusion matrix and skill scores as value-weighted.Operationally, we propose a general strategy for selecting optimal predictions among the ones provided by the training of a deep neural network.This strategy is inspired by ensemble learning, relies on the novel skill scores and implements the following steps: • A deep neural network is trained on a historical data set, thus providing a set of probabilistic forecasts (indexed over the epochs of the training process) almost equivalent in terms of forecast quality but different from each other with respect to forecast value; • The clustering of the probabilistic outcomes of the neural network is realized by maximizing a given skill score on the training set, so that the probabilistic forecasts indexed over the epochs are transformed into binary predictions; • These binary predictions are exploited within an ensemble learning framework and the approach is applied on the validation set.This ensemble network forecasting, when optimized with the value-weighted skill scores, yields better results in terms of both forecast quality and value, than the ones provided by the optimization of classical qualitybased skill scores.We showed this better effectiveness in the case of three applications.We considered the problem of forecasting the air quality in highly polluted urban environments, with particular reference to the prediction of concentration of fine particles with diameters of 2.5 µm and smaller (PM2.5)[12], [13], [14].The second problem was concerned with solar flare forecasting, i.e. the prediction of those explosive solar events that trigger most space weather phenomena [15], [16], [17].Eventually, the third application focuses on stock prize forecasting and, in particular, on the prediction of "down" movements in the market [18], [19], [20].
The plan of the paper is as follows.In Section II we define the classical skill scores and we show that binary predictions of the same quality can have completely different forecast values.In Section III we introduce a value-weighted confusion matrix and the corresponding value-weighted skill scores to quantify the forecast value of a binary prediction.In Section IV we show how this value-weighted approach works to assess the performances of standard machine learning.Section V introduces the ensemble learning process and describes its application to pollution, space weather and stock prize forecasting.Our conclusions are offered in Section VI.

II. CONFUSION MATRIX AND SKILL SCORES
The results of a binary classifier are usually evaluated by computing the confusion matrix, also known as contingency table.Let M 2,2 (N) be the set of 2-dimensional matrices with integers elements.Let y ∈ {0, 1} n be a binary vector representing the true label vector and let p ∈ {0, 1} n be a binary prediction.Then the confusion matrix C ∈ M 2,2 (N) is defined as: This definition implies that C 1,1 computes the True Positives (TPs), C 2,2 computes the True Negatives (TNs), C 1,2 computes the False Positives (FPs) and C 2,1 computes the False Negatives (FNs).From this confusion matrix several skill scores can be computed in order to evaluate the binary classifier performances.Given a confusion matrix C, we denote with S : M 2,2 (N) → R a skill score defined on the matrix C. Four frequently used skill scores are: • Accuracy (ACC) [21]: i.e., the ratio between the number of correct predictions over the total number of predictions.
• True Skill Statistic (TSS) [22]: i.e., the balance between the true positive rate (or probability of detection) and the false alarm rate.
T SS(C) ∈ [−1, 1] and it is optimal when it is equal to 1.A negative value means that forecasting behaves in a wrong way i.e. it mixes the role of the positive events with the role of the negative ones.
. The optimal value is equal to 1, a negative value meaning that forecast is worse than random forecast and the 0 value meaning that the forecast has the same skill of random forecast.
• Critical Success Index (CSI) [8]: i.e., the ratio between the number of correct event forecast and the number of events which occurred plus the number of false alarms.
and the optimal value is equal to 1.
Given a binary-vector y ∈ {0, 1} n encoding the binary outcome of an empirical observation, we can define a function F y such as which maps a binary prediction p ∈ {0, 1} n onto the confusion matrix obtained by comparing y and p.The function F y is clearly not injective.Figure 1 illustrates an example in which, given a binary observation y, four different binary predictions lead to the same confusion matrix.In the example, y has 64 components such that 14 components are equal to 1 and 50 components are equal to zero.The four predictions p (1) , p (2) , p (3) and p (4) in the four panels of the figure provide the same confusion matrix with entries TP=11, FN=3, FP=7 and TN=43.From a forecasting quality viewpoint, the four predictions are the same, since they lead to the same confusion matrix and, accordingly, to the same skill scores.However, from a forecasting value viewpoint, the four predictions are different.More specifically, prediction p (4) , for which the corresponding FPs closely anticipate the observed outcomes equal to 1 should be preferred, in value terms, than the other predictions that sound alarms after the occurrence of the events.
We now introduce a novel definition of confusion matrix, which is able to distinguish between these am- biguous configurations and therefore to locally restore the injectivity of F y .

III. VALUE-WEIGHTED CONFUSION MATRIX AND SKILL SCORES
In order to define the value-weighted confusion matrix, we first introduce the error functions such that, given the component y i of a binary observed outcome y and the component p i of a binary prediction p, ε 1,2 (y i , p i ) and ε 2,1 (y i , p i ) measure the error of the incorrect prediction when either an FP or an FN occurs.
These error functions allow the generalization of the confusion matrix concept as follows.We denote with 1 {a} a number equal to 1 when condition a is satisfied and 0 otherwise.Then the weighted confusion matrix On the one hand, quality-based forecasting assumes On the other hand, in the case of value-weighted forecasting we choose with and with In equations ( 10) and ( 12), K is a fixed positive integer number.Analogously to the case of the standard confusion matrix in Section II, we have that C1,1 , C2,2 , C1,2 and C2,1 compute, respectively the numbers of value-weighted true positives (wTPs), value-weighted true Negatives (wTNs), value-weighted false positives (wFPs), and value-weighted false negatives (wFNs).
We remark that C1,1 , C2,2 ∈ N and that they have the same definition of the ones in the classical confusion matrix whereas C1,2 , C2,1 ∈ R + and can be seen as weighted version of C 1,2 and C 2,1 with weighting functions ψ(y i , p i ) and φ(y i , p i ), respectively.In order to illustrate how these weighting functions work while computing the corresponding prediction error, we introduce the window centered in the index i, with size K.Then, for the i-th sample we have two possible cases: • Case of a false positive (y i = 0, p i = 1): ψ(y i , p i ) depends on the sequence {y j } j∈I i,K .In fact: -If no event occurs in the window, i.e. y j = 0 -If at least one event occurs in the window, i.e.
1 ∈ {y j } j∈I i,K , then Therefore, from equation ( 15) we can distinguish two situations: 1.If the event occurs before time i and there is no event in the next times, i.e. y j = 0 for since in equation (15) max 1≤k≤K ( 1 k+1 y i+k ) = 0; 2. If at least one event occurs after time i then and the weighting function decreases according to the inverse of the distance of the next event occurrence, e.g. if y i+1 = 1 then ψ(y i , p i ) = 1 2 .
• Case of a false negative (y i = 1, p i = 0): φ(y i , p i ) depends on the sequence {p j } j∈I i,K .In fact: -If no predicted alarm is in the window, i.e. p j = 0 for each j ∈ I i,K then -If there is at least one predicted alarm in the window, i.e. 1 ∈ {p j } j∈I i,K , then In particular from equation ( 19) we can distinguish two situations: 1.If the predicted alarm is after time i and there are no predicted alarms in the previous times, i.e. p j = 0 for j ∈ I i,K and j ≤ i−1, since max 1≤k≤K ( 1 k+1 p i−k ) = 0. 2. If there is at least one predicted alarm before time i, then and the weighting function decreases according to the inverse of the distance of the previous predicted alarm, e.g. if Second, the new confusion matrix gives a clearer idea on how the incorrect predictions are distributed while rolling them along the sample index.On the one hand, in the value-weighted approach, wFPs associated to p (1) notably increase with respect to the qualitybased FPs, coherently to the fact that this prediction sounds alarms far from the actual event occurrence.
On the other hand, wFPs associated to prediction p (4)   significantly decrease, coherently to the fact that, in this case, many incorrectly predicted alarms anticipate the event occurrence.Similar considerations can be repeated for what concerns the three original samples incorrectly predicted as 0: we notice that, in prediction p (4) , the number of wFNs is small, which means that the three missed events have been predicted in advance.Beijing [24].In detail, this archive includes: • Hourly weather information (dew point, temperature, pressure, wind direction and speed, cumulative number of snowing and raining hours).
The forecasting problem we considered is the one to predict whether PM2.5 concentration at time T + 1 will exceed a fixed threshold associated to a condition of severely polluted air, given weather conditions and PM2.5 concentration at time T .The machine learning approaches used to address this forecasting problem were three standard supervised algorithms: Logistic Regression (LR) [25], Support Vector Machine (SVM) [26] and a standard Neural Network (NN) [27].We trained and validated the three approaches using the dataset in the time range between 01/01/2010 and 12/28/2011 so that the training and validation sets had 17424 samples with just 122 samples labelled with 1 (corresponding to over-threshold pollution).Figure 2 shows the forecasting provided by the three algorithms in the case of a test set in the archive corresponding to the time range between 1/7/2013 at 00:00 UT and 1/15/2013 at 07:00 UT.Table II allows a quantitative assessment of these performances by means of both the standard quality-based confusion matrix and corresponding TSS, and the confusion matrix and TSS provided by the value-weighted approach.We focused on TSS because we considered a significantly imbalanced training set and in this case this score is more appropriate [28].A comparison between figure 2 and table II shows how our value-weighted approach works.
Indeed, looking at the table: • NN has significantly more FPs than the other two methods and slightly less FNs.As a consequence, its TSS is smaller than the one associated to LR (which produces a smaller number of FPs).
• NN has the highest wTSS, as a consequence of the smallest number of wFNs among the three methods.
Coherently with these values, Figure 2 shows that NN sounds either timely alarms or alarms in advance with respect to the actual event occurrence (this is particularly true in the case of the window highlighted by the grey box).Further, it provides more FPs, but these are either sounded in a close neighborhood of the actual event occurrence or corresponds to a high PM2.5 concentration level.

V. DEEP ENSEMBLE CLASSIFIERS AND APPLICATIONS
In deep learning, automatic classifiers can be constructed by applying thresholding procedures to neural networks (NNs) with probability outcomes.In this approach, a NN can be formally interpreted as the map where V represents the space of weights and the ensemble learning process implements the following 1) Train θ(V, •) on the training set {(x i , y i )} n i=1 using an iterative optimization scheme that stops after N epochs.
2) For each epoch j and given X = (x 1 , . . ., x n ), choose the classification threshold as the real number that maximizes a specific skill score S.

Therefore, if
then the optimum threshold is the solution of the optimization problem where [a, b] is a suitable interval with 0 ≤ a < b ≤ 1, I τ : R n → {0, 1} n is the indicator function I τ (ŷ) = (1 {ŷi>τ } , . . . ,1 {ŷn>τ } ) T , F y , defined as in (5), maps the binary prediction to the associated confusion matrix computed with respect to the true label vector y, and S is the skill score computed on the confusion matrix, as defined in section II.
3) Consider a validation set {(x i , ỹi )} m i=1 , and the matrix X such that X = (x 1 , . . ., xm ).For each epoch j, compute In the case where the number of zeros is equal to the number of ones, we assume p = 1.
We now show how this process works in the case of three applications concerning pollution, space weather and stock market forecasting.Our focus will be on the assessment of results when a value-weighted skill score is used in equations ( 24) and (26).

A. Pollution forecasting
We consider the same data and the same forecasting  Keras library [29], in which the sigmoid function and the binary cross-entropy are used as activation function and loss function, respectively.The model is trained over N = 100 epochs using the Adam Optimizer [30] with learning rate equal to 0.001 and mini-batch size equal to 72, respectively.As explained at the beginning of this section, the ensemble output is determined by combining estimators associated to epochs where the skill score computed on the validation set is higher than the quality level α (see equation ( 26)).In this experiment α is fixed equal to 0.9.

B. Solar flare forecasting
Solar flares are the main trigger of space weather events, including Coronal Mass Ejections and solar wind [31].Their prediction may rely on features extracted from magnetogram images of active regions (ARs) like the ones recorded by the Helioseismic and Magnetic Imager (HMI) on-board the Solar Dynamics Observatory (SDO) [32].We considered images from the HMI archives in order to setup an experiment in conditions analogous to the ones considered in [33], [34]: • We first grouped the images according to their issuing times and selected, in particular, just the images recorded at issuing time 00:00 UT.
• 23 features were extracted by means of the algorithms illustrated in [35].point out that, in this way, the validation set contains just two X1+ events, both associated to AR 12673 [33], [36].
In order to implement the ensemble learning approach we used a deep multi-layer perceptron with 7 hidden layers.The Rectified Linear Unit (ReLU) function was used to activate the hidden layers, the sigmoid activation function was applied to activate the output and the binary cross-entropy was used as loss function.The model was trained over 100 epochs using the Adam optimizer with The results of this analysis are shown in Table IV and imply once again that the classification strategy based on the maximization of wTSS leads to higher scores (and more diagonal confusion matrices) with respect to the results provided by the maximization of TSS.This is particularly significant in the case of the prediction of M1+ and X1+ flares, i.e. in the case when the training set is significantly more imbalanced.Figure 5 enrolls over time the prediction associated to AR 12671 [37], [38], which originated many C1+ flares but no M1+ and no X1+ events.The figure clearly shows that the use of a value-weighted strategy leads to a significantly smaller number of FPs in the prediction of both M1+ and X1+ events.

C. Stock prize forecasting
We considered the problem of predicting "down" movements in stock prizes relying on information concerned with the daily closure prizes.More specifically, the feature utilized as input of the forecasting algorithm is a time series of five days of daily percentage change defined as [39] η where P N −1 is the closure prize at day N − 1 and P N is the closure prize at day N .We used as label for this feature the condition where L = −1 corresponds to the "down" movement.
We both the quality-and value-weighted approaches.These numbers show that, when we choose the wTSS optimization strategy, the ensemble learning method leads to predictions with lower TSS but higher wTSS.We further point out that, in stock index forecasting applications, the accuracy is often used for performance evaluation.
Therefore, in table V we also report both the quality-and value-weighted accuracy, although in our experiments the data sets are imbalanced, so that accuracy-type scores are less reliable than other skill scores like, for example, the TSS-type ones.Both accuracy and weighted value accuracy are slightly better for the strategy based on wTSS optimization.
In order to assess the value-weighted approach in an operational framework, we simulated the following investment strategy, starting from an initial asset of 10 stocks: 1) If at day N − 1 a "down" movement is predicted for day N , then we sell two stocks.
2) If either at day N , or day N + 1, or day N + 2 the "down" movement occurs, we use all the money earned at step 1 to buy stocks.At day N + 3 we buy in any case.
This strategy is applied on the test set and the results of this analysis, illustrated in Figure 6, show that, in a long term perspective, the asset value provided by the value-weighted strategy overtakes the one provided by a standard quality-based strategy.

VI. COMMENTS AND CONCLUSIONS
This study introduces two novelties in the machine learning game.
The first one is a way to assess the forecasting performances of a machine learning algorithm, in which false positives anticipating the actual event occurrence are weighed less than the ones associated to alarms sounded behind schedule.This approach allows the definition of value-weighted skill scores that evaluate the forecasting performances in a way which is more appropriate for decision making processes.
The second novelty is related to the optimization of ensemble learning for classification purposes.Indeed, in three different forecasting contexts, we show that when   The next step in this investigation will be the encoding of these value-weighted newly introduced skill scores in the loss functions utilized as part of the ensemble learning algorithm, in such a way to implement a forecasting approach in which the value-weighted strategy is a priori introduced in the optimization process.

2 .
We applied the value-weighted confusion matrix C relying on this error function on the motivating example in Figure 1.The results in Table I inspires the following comments.First, differently than what happens with the classical quality-based confusion matrix C, the off-diagonal terms of C depend on the prediction vector.

4 ) 26 ) 5 )
ŷwj X := (θ(w j , x1 ), . . ., θ(w j , xm )) T .(25) Given a quality level α, select just the epochs for which the skill score S computed on the validation set is bigger than α.This allows the selection of the set of epochs J * := {j ∈ {1, . . ., N } : S(F ỹ(I τ * j (ŷ wj X ))) > α}.(Define the result of the ensemble learning process as the binary value corresponding to the median value m among all binary predictions associated to J * , i.e. given a new sample x the output is defined as problem discussed in Section IV, i.e. the prediction of over-threshold occurrences of PM2.5 concentration at time T + 1, having at disposal measures of this concentration and of eight features associated to weather conditions at time samples from 0 through T .The ensemble learning procedure is applied on the same training and validation sets as in Section IV.The computational core is now represented by a Long Short Term Memory (LSTM) neural network available in the

Figure 3
provides an illustrative example of the fact that using value-weighted skill scores leads to a stricter epoch selection strategy.Table III contains the results of the application of the ensemble learning approach on the test period from 12/29/2011 00:00 to 12/31/2014 23:00.Specifically, the table compares the classification performances provided by the optimization of wTSS versus the ones provided by the optimization of TSS.These numbers clearly show that optimizing the value-weighted skill score systematically leads to (sometimes even significantly) higher scores due to the fact that the number of wFPs is smaller than the number of FPs (while the numbers of wFNs and FNs do not change).Figure 4 offers a visualization enrolled over time of the behavior of the two optimization strategies on the test period considered in Section IV.

•
The labelling process utilized the flare occurrence alarm sounded by the Geostationary Operational Environment Satellites (GOES) cluster and associated label 1 to each feature vector where a flare was recorded within 24 hours after the issuing time.GOES also computes the flare energetic class and, in particular, in the experiment we considered events of GOES class C1 and above (C1+ flares), M1 and above (M1+ flares) and X1 and above (X1+ flares).The training process exploited the HMI archive in the time range from 09/15/2012 through 10/02/2015 and the validation process from 09/29/2015 through 01/11/2015 for the prediction of C1+ and M1+ solar flares.For the prediction of X1+ solar flares we considered a different splitting in order to have a reasonable number

Fig. 3 .
Fig. 3. TSS and wTSS versus epochs, where the scores are computed on the training set (red line) and on the validation set (blue line) in the case of the PM5.2 pollution experiment.Left panel: TSS with optimization strategy involving TSS.Right panel: wTSS with optimization strategy involving wTSS.

Fig. 5 .Fig. 6 .
Fig. 5. Predictions enrolled along time by the ensemble method when the TSS and wTSS optimization strategies are adopted (top and bottom panel, respectively).Green alarms correspond to C1+ flares, yellow alarms to M1+ flares, and red alarms to X1+ flares.The blue bar plots correspond to the actual flaring events recorded by GOES, the y-label representing the corresponding GOES flare classes.Note that the grey boxes correspond to time period where the input data are missing.

TABLE II RESULTS
PROVIDED BY LR, SVM AND NN IN THE CASE OF THE POLLUTION FORECASTING EXPERIMENT, WHEN THE TEST SET IS THE UCI DATABASE IN THE TIME RANGE FROM 1/7/2013 AT 00:00 THROUGH 1/15/2013 AT 07:00.

TABLE IV SOLAR
FLARE FORECASTING.RESULTS ON THE TEST SET OBTAINED BY REALIZING CLASSIFICATION VIA TSS OPTIMIZATION (SECOND, FOURTH AND SIXTH COLUMN) AND VIA WTSS OPTIMIZATION (THIRD, FIFTH AND SEVENTH COLUMN).

TABLE V STOCK
PRIZE FORECASTING.RESULTS ON THE TEST SET EXTRACTED FROM THE YAHOO FINANCE DATABASE, OBTAINED BY USING THE TSS AND WTSS OPTIMIZATION STRATEGIES.