Uncertainty-Aware Learning With Label Noise for Glacier Mass Balance Modeling

Glacier mass balance (MB) modeling is crucial for understanding the impact of climate change on Earth’s freshwater resources and sea-level rise. Recent works have shown the benefit of using machine learning (ML) and deep learning (DL) methods to better capture the nonlinearities in the system than commonly used temperature-index models. However, when relying on remote sensing products for training, the presence of data noise is a challenge for these methods, and therefore quantifying the uncertainty becomes essential. In this work, we produce a tabular dataset consisting of annual MBs for 1000 glaciers over 20 years with meteorological and topographical input features. Using this dataset, we systematically study various uncertainty estimation methods and their impact on the quality of the predictions. Our experimental results show that ensemble methods are promising for capturing the uncertainty in the data: their predictions are more accurate, more robust against label noise, and better calibrated. In particular, the multilayer perceptron (MLP) ensemble coupled with an explicit noise model shows an increase of up to 5.5% in the explained variance and is much less affected by the gradually injected label noise: the average mean absolute error (MAE) increases at a rate twice smaller. For reproducibility, code and data are available at https://github.com/dcodrut/oggm_smb_dl_uq.


Uncertainty-Aware Learning With Label Noise for Glacier Mass Balance Modeling
Codrut-Andrei Diaconu , Graduate Student Member, IEEE, and Nina Maria Gottschling Abstract-Glacier mass balance (MB) modeling is crucial for understanding the impact of climate change on Earth's freshwater resources and sea-level rise.Recent works have shown the benefit of using machine learning (ML) and deep learning (DL) methods to better capture the nonlinearities in the system than commonly used temperature-index models.However, when relying on remote sensing products for training, the presence of data noise is a challenge for these methods, and therefore quantifying the uncertainty becomes essential.In this work, we produce a tabular dataset consisting of annual MBs for 1000 glaciers over 20 years with meteorological and topographical input features.Using this dataset, we systematically study various uncertainty estimation methods and their impact on the quality of the predictions.Our experimental results show that ensemble methods are promising for capturing the uncertainty in the data: their predictions are more accurate, more robust against label noise, and better calibrated.In particular, the multilayer perceptron (MLP) ensemble coupled with an explicit noise model shows an increase of up to 5.5% in the explained variance and is much less affected by the gradually injected label noise: the average mean absolute error (MAE) increases at a rate twice smaller.For reproducibility, code and data are available at https://github.com/dcodrut/oggm_smb_dl_uq.Index Terms-Ensemble learning, glacier mass balance (MB) modeling, noisy labels, robustness, uncertainty quantification (UQ).

I. INTRODUCTION
T HE cryosphere, as any other component of the Earth system, is highly complex and nonlinear.Modeling it accurately remains challenging, especially at regional scale [1].As the societal and environmental impact of the retreat of glaciers is certain [2], appropriate methods for modeling and predicting the evolution of the glaciers are important to adapt necessary policies [3].The glacier mass balance (MB), defined as the sum of accumulation (e.g., through snow, avalanches, refreezing of rain) and ablation (e.g., through surface melting, drifting snow, sublimation) [4], over a fixed period of time, Codrut-Andrei Diaconu is with the German Aerospace Center (DLR), 82234 Weßling, Germany, and also with the Technical University of Munich (TUM), 80333 Munich, Germany (e-mail: codrut-andrei.diaconu@dlr.de).
Digital Object Identifier 10.1109/LGRS.2024.3356160 is one of the most important components in glacier modeling and also one of the essential climate variables (ECVs) [5].
In [6], it was shown that there is a significant nonlinear part in the relationship between climate and MB.Supporting this assumption, Bolibar et al. [7] show that deep learning (DL) captures the nonlinear response of MBs to temperature and precipitation, especially in extreme cases, better than classical approaches, such as linear statistical and temperature-index models.
This is opposed to the commonly used glacier MB models that can be applied at a large scale.These often rely on temperature-index models [1], which assume a linear relationship between the days with above zero temperatures and the melting of ice or snow [8].Hence, it is promising to apply DL methods, such as nonlinear neural networks (NNs) or classical machine learning (ML) models, such as random forest (RF), as a statistical method to predict glacier MBs.However, these models use data based on in situ measurements (e.g., using ablation stakes) or remote-sensed data (e.g., using digital elevation model (DEM) differencing) [9], [10].Both the approaches have nonnegligible uncertainties due to measurement errors, sampling biases, or shortcomings in the methodology.For example, the in situ measurements have an accuracy typically lying between 0.1-m water equivalent (w.e.) and 0.6-m w.e.[11].Thus, MB models are trained with noisy labels, yet should ideally make noise-free predictions.Uncertainty quantification (UQ) methods could solve this issue, by modeling the data noise and the model uncertainty, and thereby disentangling them from the mean predictions.
ML has recently become popular for MB modeling: [7] projects the 21st-century glacier evolution in the French Alps with a standard multilayer perceptron (MLP) model for MB as a better alternative to linear regression (LR); [12] models winter point MBs using gradient boosting regressor (GBR); [13] estimates annual point MBs using four different methods, i.e., support-vector machine (SVM), RF, GBR, and MLP.None of these studies models any uncertainty source and only uses the testing errors as a quality indicator.Given that glaciers are critical components in the Earth system and a significant percentage of the world's population (∼22%) is relying on their water storage capacity [2], if policy makers are to make decisions based on predictions obtained from ML/DL methods, then it is paramount that they are not just a black-box tool but provide reliable uncertainty estimates.We aim to bridge this gap, by coupling NNs with different UQ methods for MB prediction and investigating their behavior with respect to label noise, by making the following] contributions.
1) We provide a dataset for MB regression suitable for studying UQ methods.2) By systematically adding label noise, we compare various models (LR, RF, and six MLP versions coupled with different UQ components) with respect to their predictive performance, the quality of the UQ estimates, and their robustness against noisy labels.

II. DATASET
There are various limitations for datasets of MB reconstructions.For in situ measurements, these include limited annual glacier-wide observations, e.g., the world glacier monitoring service (WGMS) [14]-a database gathering all in situ measurements-contains less than 500 glaciers, which is considerably less than the almost 200 000 glaciers worldwide [15].In addition, there are uncertainties such as measurement accuracy, the distribution of the ablation stakes or snow pits, and the interpolation method, which are glacier-specific and difficult to estimate [11].For MB estimates based on remote sensing techniques, an advantage is the increased coverage, and various approaches to estimate glacier MB have been proposed (see Table 3 from [10]).However, there are also sources of uncertainties (e.g., the volume-to-mass conversion) and discrepancies between derived MB estimates [10].In addition, geodetic estimates at glacier level are usually available as multiannual averages, thus limiting their use for calibrating annual/seasonal models [10].Thus, we use MB reconstructions instead of measurements to investigate the potential of ML and DL models for annual glacier MB modeling.

A. Dataset Construction
We use the open global glacier model (OGGM) [16], an open-source framework for glacier modeling, to reconstruct the annual MBs of the 1000 largest glaciers in Central Europe (out of 3927, cf.[15]), which cover about 90% of the total glaciated area in the region.The MB model used in OGGM requires temperature and precipitation as drivers [17] which are obtained from [18].OGGM calibrates its parameters using the 20 years average MBs from [19].We limit the analysis to the same 20-year period (i.e., 1999 − 2019), resulting in a total of 20 * 1000 = 20 000 data entries.As inputs, we use the same meteorological drivers (i.e., monthly temperature and precipitation averages) as OGGM, as well as six topographical features (area, minimum, maximum and mean elevation, slope, aspect-sine & cosine), resulting in a 31-D input.The topographical features are added to compensate for the fact that OGGM estimates the MB in a pointwise manner, along multiple lines distributed over a glacier (called flow lines [16]), whereas in our approach we train a glacierwide MB model.

B. Label Noise Injection
The reconstructed MBs have a mean of −0.73-m w.e, reflecting the observed mass loss over the past two decades [19], and a standard deviation σ data ≈ 0.78-m w.e.We inject Gaussian noise in the labels and build five scenarios denoted by z = z noise , z noise ∈ {0.1, 0.2, . ., 0.5}, where z noise controls the noise values η relative to σ data : η ∼ N (0, σ 2 noise ) where σ noise = z noise • σ data .This results in a noise standard deviation varying from 0.08-to 0.39-m w.e., similar to the range of the errors estimated for the in situ measured data [11].Noise-free labels are denoted by z = 0.0.Given that the region we cover in our dataset is relatively small, we found that a Gaussian homoscedastic noise model is a reasonable choice, as it approximately matches the estimated errors of the observed MBs used for calibrating OGGM [19].A more detailed explanation and limitations of this choice are provided in the Supplementary.Another reason is that the focus of our study is on investigating whether coupling the models with UQ components improves their robustness against label noise, making use of the total predictive uncertainties rather than focusing on the aleatoric uncertainty alone.

A. Brief Introduction of Methods
Given the set of input-target pairs from our dataset, D train = {(x i , y i )} K i=1 }, the task of the models is to predict a target y ⋆ ∈ Y given an input x ⋆ ∈ X such that the loss objective between the predictions and targets is minimized over all the training points.The model can be regarded as a function f θ , parameterized by weights θ , which maps inputs x directly to targets y ∈ Y , f θ : X → Y or to a probability distribution, In the following, we briefly describe the eight models used, which include an LR model, an RF Regressor, and five variants of NNs built upon an MLP.A more detailed description is provided in the supplementary material.
Linear Regression: standard multi-LR model used as baseline, to support the claim that nonlinear models are more suitable for glacierwide MB modeling.
Random Forest: introduced by Breiman [20], it consists of training randomized decision trees using bootstrapping and then aggregate the predictions by averaging them.A review of RF as a powerful tool for classification and regression is provided in [21].We moreover consider the variance of the predictions as a measure of predictive uncertainty.
Multilayer Perceptron: a simple fully connected network with two hidden layers and a nonlinear activation function, used as baseline.
Gaussian MLP (MLP+NLL): a deterministic model that predicts the parameters of a Gaussian distribution in a single forward pass, where standard deviations σ θ (x ⋆ ) can be used as a measure of data uncertainty.This is achieved by extending the output of the previous architecture to two dimensions and train it with the negative log-likelihood (NLL) of a Gaussian as a loss objective [22].MC-Dropout (MLP+MCD): an approximate Bayesian method with sampling, as in [23].A fixed dropout rate p is added, meaning that the weights are randomly set to zero during each forward pass with the probability p.This models the network weights and biases as a Bernoulli distribution with dropout probability p.We also consider combining this method with the previous model (Gaussian MLP), as in [22], aiming for disentangling the data and model uncertainties, abbreviated as MLP+NLL+MCD.Ensembles [Ensemble (MLP)]: introduced in [24], Deep Ensembles approximate a posterior distribution over the model weights with a Gaussian mixture model over the output of separately initialized and trained networks.Wilson and Izmailov [25] showed that Deep Ensembles can be interpreted as a Bayesian method.In addition, each ensemble member can be a Gaussian MLP, denoted as Ensemble (MLP+NLL).

B. Metrics
Regression tasks are commonly evaluated by accuracy metrics such as root mean squared error (RMSE), mean absolute error (MAE), or coefficient of determination (R 2 ).A better quality of prediction is indicated by a lower RMSE and MAE and an R 2 score close to 1.0.However, these measures only characterize the error between point predictions and available targets.To compare the predictive uncertainties to the target distribution, we need additional metrics, such as proper scoring rules [26].We consider the NLL of a Gaussian as a proper scoring rule [26].We also report the miscalibration area, where a lower miscalibration area indicates a better fit of the predictive uncertainties to the true target distribution.To quantify the overall confidence of a model in a single metric, we consider sharpness which computes the mean of the predictive uncertainties.We use [27] for computing these metrics.

A. Evaluation Details
From the dataset, we keep 20% of the glaciers for testing and the remaining are split at glacier level into training and validation (90% and 10%, respectively).To reduce the impact of randomness in our results, we repeat the experiments ten times with different data splits and different model initializations.The hyperparameters of each method are provided in the Supplementary.

B. Evaluation of Mean Predictions
Controlling the label noise allows us to compare the models with respect to robustness, by analyzing which models can still predict the true labels accurately when trained with increasing label noise.
In Table I, we show the MAE, RMSE, and R 2 on clean labels for all the models trained with the noisy labels with z = 0.3.LR performs the worst with a significant gap compared with the other models.The two MLP ensemble versions perform the best on all the metrics, closely followed by RF.For R 2 scores, we observe that all the methods (except LR) attain values in [85%, 90%], the Ensemble (MLP+NLL) model outperforming RF only by 1.8%.The tables with the accuracy metrics for the other noise levels are included in the supplementary material and show the same trends.We also included the mean bias error (MBE) as an additional metric, which is in general very small (less than 3-cm w.e.), with little variance across methods and no correlation to the noise level.
We investigate the robustness to training on increasing label noise by assessing the MAE of the models.Fig. 1 shows the MAE distribution of the ten differently initialized models.Taking into account the variance due to initialization, the two MLP ensemble methods still perform best, followed by RF.As expected, the MAE increases with increasing label noise for all the models.Table I shows that the variance of the results is comparable across methods.
To assess which models are affected the most by the increasing noise, we show the average MAE scores obtained when training on clean labels as a baseline and compute the change (expressed in percentages) when training on noisy labels in Table II.All the models show increasing MAE with increasing noise, reaching up to 16.5% increase (MLP+NLL for z = 0.5).The Ensemble (MLP) and Ensemble (MLP+NLL) increase only by 3.6% and 5.4%, respectively, whereas the others exceed +10%.The models trained with NLL (i.e., MLP+NLL and MLP+NLL+MCD) are relatively more affected, reaching +10% already at z = 0.4.This is also reflected when comparing the two MLP Ensemble versions, where Ensemble (MLP) performs better.

C. Evaluation of Predictive Uncertainties
In Section IV-B, we showed that coupling the models with uncertainty estimation components helps improve their robustness and yields improved accuracy.Yet, in many applications it also important to provide uncertainty estimates, e.g., to make risk assessments or withdraw from predictions that have high uncertainty.In a real-world scenario, one does not have access to clean labels which we previously exploited for robustness evaluation.In this section, we investigate how well the UQ methods capture the uncertainties, using the metrics described in Section III-B evaluated with the noisy labels.Moreover, we assess whether these methods provide useful uncertainties using selective prediction, as introduced in [28].Here, samples with a predictive uncertainty above a given threshold are omitted from prediction and, e.g., referred to an expert or a different method.If larger uncertainties are correlated with worse predictions, this increases overall accuracy.
Table III shows miscalibration area, sharpness, and NLL scores, obtained for the average noise case (z = 0.3).Compared with the previous results, discrepancies between the methods are higher.The Ensemble (MLP+NLL) model obtains a lower miscalibration area compared with Ensemble (MLP) (which performs the worst), closely followed by RF.The sharpness is much smaller for the standard MLP ensemble.There are large variations for the NLL scores and the Ensemble (MLP+NLL) performs again the best, with RF performing similarly and Ensemble (MLP) the worst.The tables with the UQ metrics for the other noise levels are included in the supplementary material and show the same trends.For a more detailed analysis, a figure of the calibration curves for all the noise levels is also included in the supplementary, where it can be observed that the Ensemble (MLP) model is highly overconfident, reflected by the high miscalibration area in Table III.
Finally, we assess whether the uncertainty scores can improve the accuracy with selective prediction.Fig. 2 shows the average performance (MAE), of each model against the coverage percentage, i.e., the percentage of samples with the lowest predictive uncertainties, the remaining ones being dropped.Ideally, we want to see a better performance when using the least uncertain samples but we can see that only the ensemble methods (including RF) have this behavior.Selective prediction: Test performance (MAE) on clean labels averaged for all the data points which have the estimated total uncertainty score below a certain threshold (x-axis).The models are trained on the noisy labels with z = 0.3.
Selective prediction applied to the models trained on the other noise levels shows similar trends and is included in the supplementary material.

V. DISCUSSION
The results described in Section IV-B indicate that a linear model is not sufficient for glacier-wide MB modeling, thus suggesting that the problem is nonlinear, as found in previous studies [6], [29].Among the nonlinear methods, we observe that the top performing ones are the ensemble methods, including RF.The fact that Ensemble (MLP+NLL) yields the overall best results provides evidence that coupling the model with this aleatoric uncertainty component also improves predictions.However, training a RF remains easier and faster, which makes it also a good candidate, with a relatively small performance gap compared with Ensemble (MLP+NLL).We also found RF to be less sensitive to the choice of hyperparameters compared with the MLPs; probably also explained by the larger ensemble size (up to 500 trees were used versus only ten for the MLPs).
When analyzing the influence of increasing the label noise on performance, ensembles of MLPs are again favored, as their performance degrades slower compared with the other methods.Here, the gap between RF and the ensembles of MLPs is higher, which indicates RF is more prone to overfitting on our dataset.In the large-scale study from [30], tree-based models were found to perform better on tabular data than NNs, as they can approximate irregular functions whereas NNs tend to be biased toward smoother solutions.However, in our context, this inductive bias could be beneficial when dealing with large amounts of noise, potentially making NNs less prone to overfitting, an aspect which was previously investigated for classification tasks in [31].
Concerning UQ (Section IV-C), the complementary metrics we used (i.e., calibration, sharpness and NLL) reveal that Ensemble (MLP+NLL) matches our dataset distribution Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the best.Ensemble (MLP) is overconfident, which explains the comparably low sharpness.This indicates that predicting the parameters of a Gaussian enables disentangling the model and data uncertainty, shown in the figures of model and data uncertainty in the supplementary material.The RF also provides relatively well calibrated predictions.Furthermore, the selective prediction results also indicate that the three ensemble methods perform the best.The Ensemble (MLP+NLL) is slightly outperforming, as its average MAE stays relatively low until it reaches a significant coverage (>= 75%), thus making it a good candidate in practice.The nonensemble models perform in general worse, both from the perspective of predictive power and uncertainty estimation.

VI. CONCLUSION
We introduce a simple and relatively small dataset for an important regression task in glacier modeling: predicting the annual MB using meteorological and topographical drivers.We then compare various methods (LR, RF, MLPs, and ensembles of MLPs) on how they perform when trained with noisy labels while still evaluating them on the clean labels.The ensemble methods performed the best (including RF), being more robust when increasing the label noise.When coupling the ensemble of MLPs with a Gaussian output, thus explicitly modeling the data uncertainty, the performance increases and the predictions become significantly better calibrated.The uncertainties from the ensemble methods can also be used for selective prediction, leading to more accurate and reliable MB predictions while still keeping a significant coverage.
We would therefore recommend the ensemble methods for glacier-wide MB modeling to the cryosphere community, in particular the Ensemble (MLP+NLL) version.However, these models are sensitive to the hyperparameters, so significant effort should be allocated to tuning these.From this perspective, RF was more robust but we would still recommend performing HPO: we observed that the final models grow smaller trees when having a large amount of noise, an indicator that HPO can prevent overfitting.
One promising extension of this study is to inject noise in the input data based on certain features (i.e., a heteroscedastic noise model) which is closer to the real setup, as for instance, the remote-sensing-based MBs from [19] have larger errors for small glaciers.

Manuscript received 29
September 2023; revised 22 December 2023; accepted 3 January 2024.Date of publication 19 January 2024; date of current version 6 February 2024.This work was supported by the Helmholtz Association Initiative and Networking Fund on the HAICORE@FZJ Partition.The work of Codrut-Andrei Diaconu was supported by the Helmholtz Association through the Joint Research School Munich School for Data Science-(MuDS) under Grant HIDSS-0006.(Corresponding author: Codrut-Andrei Diaconu.)

Fig. 1 .
Fig. 1.Robustness evaluation: Test performance (MAE) for all the models (except LR) trained on multiple levels of noise and evaluated on clean labels.

TABLE I
ACCURACY METRICS: PERFORMANCE SCORES (µ ± σ ) EVALUATED ON CLEAN LABELS FOR THE MODELS TRAINED WITH AN AVERAGE AMOUNT OF NOISE (z = 0.3)

TABLE II RELATIVE
PERFORMANCE DIFFERENCE: CHANGE IN AVERAGE MAE SCORES, EVALUATED ON CLEAN LABELS, WHEN TRAINING ON NOISY LABELS (z ≥ 0.1) COMPARED WITH TRAINING ON CLEAN LABELS (z = 0.0)

TABLE III UQ
METRICS: (µ ± σ ) FOR THE MODELS TRAINED AND EVALUATED ON THE NOISY LABELS WITH z = 0.3