Flash Floods Prediction Using Precipitable Water Vapor Derived From GPS Tropospheric Path Delays Over the Eastern Mediterranean

A flash flood is a rapid and intense response of a drainage area to heavy rainfall events. In the arid and semiarid parts of the Eastern Mediterranean (EM) region, the spatiotemporal distribution of rainfall is the most important factor for flash flood generation. A possible precursor to heavy rainfall events is the rise in tropospheric water vapor amount, which can be remotely sensed using ground-based global navigation satellite system (GNSS) stations. Here, we use the precipitable water vapor (PWV) derived from nine GNSS ground-based stations in the arid part of the EM region in order to predict flash floods. Our approach includes using three types of machine learning (ML) models in a binary classification task, which predicts whether a flash flood will occur given 24 h of PWV data. We train our models with 107 unique flash flood events and vigorously test them using a nested cross-validation technique. The results indicate a good agreement between all three types of models and across various score metrics. In addition, the models are further improved by adding more features such as surface pressure measurements. Finally, a feature importance analysis shows that the most important features are the PWV values from 2 to 6 h prior to a flash flood. These promising results indicate that it is possible to augment the current flash flood warning systems with a near real-time GNSS ground-based data-driven approach as demonstrated in this work.


I. INTRODUCTION
28 F LASH floods are rapid and high-intensity flooding events, 29 which are mainly caused by heavy rainfall. Since flash 30 floods have a short response time of several hours, they are 31 difficult to predict and cause damages and even casualties [1]. 32 Among the factors that control the flash floods generation 33 (e.g., soil saturation and surface cover), the spatiotemporal 34 distribution of rainfall is the most significant one when . 39 Therefore, in order to predict flood events, one must first 40 account for the heavy rainfall events' location and timing, 41 which can be achieved by remote sensing platforms (e.g., 42 weather radar [9], [10], [11]). Another option is to measure the 43 water vapor (WV) amount in the atmosphere in order to detect 44 a mass moisture transport that is a perquisite to heavy rainfall 45 events. One such method is the global navigation satellite 46 system (GNSS) meteorology, which can continuously provide 47 a near real-time estimate of the precipitable WV (PWV) above 48 the ground-based receiver's location [12], [13]. 49 GPS satellites that orbit the Earth at ≈20 200 km trans-50 mit navigation messages via radio waves to ground-based 51 receivers. Each navigation message includes information that 52 allows the ground-based user to find its position up to 53 centimeter-or even millimeter-level accuracy [14], [15]. This 54 precise point positioning (PPP) method, as it is called, can 55 also be used to estimate the PWV content between each 56 GPS satellite and the ground-based receiver. As they reach 57 Earth, the transmitted radio waves are effectively dispersed by 58 the ionosphere and absorbed by the troposphere [16]. These 59 processes produce a measurable delay in the radio message 60 upon arrival in the ground-based receiver. The ionospheric 61 dispersion effect can be accounted for since GPS satellites 62 transmit the radio waves in at least two frequency bands 63 (e.g., L1 = 1575.42 MHz, L2 = 1227.6 MHz, and L5 = 64 1176.45 MHz) [17]. The tropospheric delay or the zenith 65 tropospheric delay (ZTD) 1 consists of two major sources of 66 absorption processes: hydrostatic delay or zenith hydrostatic 67 delay (ZHD), 1 which is mainly due to the effect of atmospheric 68 pressure [18] on the radio signal and wet delay that is due to 69 the radio signal's interaction with water molecules [19]. The 70 wet delay can be estimated by subtracting the ZHD from the 71 ZTD.

72
In each ground-based receiver, the radio messages are stored 73 as text files called Receiver Independent Exchange Format 74 (RINEX) and can be processed by dedicated software such as 75 NASA's JPL GipsyX [20]. The processing involves complex 76 inversion algorithms [21], [22] and is used in order to solve 77 the precise position of the ground-based receiver. From the 78 position solution of the receiver, the ZTD can be extracted, 79 1 The delay is given in zenith values by using mapping functions. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. PWV at Yerucham (YRCM) GNSS station superimposed on the water discharge (flow) at the Mamsheet hydrometric station located 12 km east of YRCM on April 24-27, 2018. Note the three major flash flood events on the 25th, the 26th, and the 27th. The PWV more than doubled during the second half of the 24th as a low-pressure system provided large quantities of moisture to the region. while the ZHD is provided with the empirical mapping func-  [23], [24]. Here, we used the GipsyX default 85 option that is the GMF along with GPT2 data files to obtain 86 the ZHD estimations. The obtained zenith wet delay (ZWD; 87 ZTD-ZHD) is proportional to the total amount of WV in a 88 vertical atmospheric column. Thus, the ZWD can be converted 89 into PWV by using a WV mean atmospheric temperature [18]. 90 In the past 30 years, GNSSderived PWV has been exten-91 sively compared to many other remote sensing platforms 92 (e.g., Sun photometers and radiometers), radiosonde in situ 93 measurements, and reanalysis products that resulted in an root 94 mean square error (RMSE) ranging from 1 to 3 mm [25], 95 [26], [27], [28], [29], [30], [31]. Furthermore, PWV maps can 96 also be assimilated into modern numerical weather prediction 97 models (e.g., WRF), which effectively lowers the WV predic-98 tion RMSE by more than 30% when compared to radiosonde 99 measurements [32].

100
Using GNSS ground-based meteorology to monitor PWV 101 before, during, and after heavy rainfall events is not new. more than doubled during the second half of the 24th event 117 as a low-pressure system entered the region. Moreover, the 118 flood events' peak discharges lag after the closest PWV peak 119 values for the 26th and 27th events, while the 25th double 120 peaked event shows a more complex behavior and can be the  Later on, they also used GNSS TEC data along with the SVM 142 model for potentially predicting strong earthquake events [42]. 143 Hsu [43] used an SVM classifier to separate the type of 144 GNSS pseudorange measurement into three categories: clean, 145 multipath, and nonline of sight, thus evaluating several features 146 which were estimated from the GNSS raw data, including 147 the received signal strength. In addition, he also proposed 148 a new feature to indicate the consistency between measure-149 ments of pseudorange and Doppler shift. Linty et al. [44] 150 used an ML decision tree and random forest (RF) algorithms, 151 applied with big sets of 50-Hz postcorrelated GNSS data 152 for automatic, accurate, and early detection of amplitude 153 ionospheric scintillation events, reaching a detection accuracy 154 of 98%.

155
Our goal in this work is to investigate the ability of 156 GNSS-derived PWV to predict flash floods events in the 157 arid part of the EM region using three types of ML mod-158 els. Accordingly, Section II describes the PWV data and 159 flood events used in this work along with all the ML 160 methodology [e.g., preprocessing, metrics, and cross validation 161 (CV)]. Section III presents the ML models' performance 162 along with a feature importance analysis. We discuss the 163 results in Section IV and our concluding remarks follow in 164 Section V.  Table I      my_trees/ISROcnld/ISROcnld_0.tree). The processing has 204 resulted in ZWD that was translated into PWV using the 205 following formula [13]: is the dimensionless constant of proportionality and is 208 mainly the function of the atmospheric mean temperature. 209 automated stations and radiosonde measurements [12] in order  continuous. In order to detect the effect that PWV has on 250 flood events, we averaged the PWV anomalies six days prior 251  Table II)    as we did with the PWV data. Fig. 5 shows the mean 275 pressure anomalies at Bet-Dagan station prior and after a flood 276 event. As expected, the pressure drops before a flood event, 277 representing a low-pressure system that produces precipitation 278 events. The minimum pressure values are found about 6-8 h 279 prior to a flood event. However, the variability is quite higher 280 than the PWV dataset. This issue can be the result of using 281 pressure data from only one station, which represents all 282 the flood events. Unfortunately, we could not find enough 283 surface pressure records that are co-located with the selected 284 hydrometric stations for the same data period. Furthermore, 285 since summer rain is very rare in the EM [48], we also 286 added the DoY information as a feature to our PWV and 287 surface pressure features. Fig. 6 shows the flood event count 288 for each month of the year. It is clear that the most frequent 289 month is January, with 30 events, while February-April and 290 October-December have a mean of 11 events. May, June, and 291 September have only a few events, while July and August have 292 no flood events, as expected. steps from the data preprocessing to producing the best model. 296 We, therefore, elaborate on these steps in the following. PWV and surface pressure that are resampled to hourly means. 313 We then co-located each GNSS and hydrometric station and 314 found 24 data points of PWV prior to each flood event. If half 315 or more of the PWV data was missing, we dropped this 316 event from our analysis. We used cubic interpolation to fill in 317 the missing data points otherwise. We repeated this process 318 with the surface pressure data, and however, in this case, 319 we had only one surface pressure station (Bet Dagan) with 320 the necessary data period and resolution. This step leaves us 321 with 49 features (48 for PWV and pressure along with one 322 for DoY). As for the negative class, we randomly searched 323 for 24 h of PWV and pressure, which do not overlap the 324 positive features, and we repeat this step only once for each 325 flood event in each station, thus ensuring that the binary 326 classification task is balanced. Our resulting matrix of features 327 and samples is 214 (107 for each class) by 49. Finally, since 328 two of our classifiers are sensitive to feature normalization, 329 we use the standardized 3 version of the PWV and surface 330 pressure anomalies for all the classifiers.

331
Our main goal is to use supervised learning classifiers 332 in order to predict flash floods using PWV as the main 333 input. Accordingly, we chose three common types of ML 334 models: SVM, RF, and multilayered perceptron (MLP). All 335 the models were implemented using the Scikit-Learn Python 336 package [49].

337
The SVM classifier utilizes a linear hyperplane to separate 338 each sample class [50]. Using the kernel trick, the hyperplane 339 is transformed into a higher dimension, which gives the SVM 340 more flexibility; however, the cost is a larger generalization 341 error [51]. The RF classifier is a metaclassifier, which uses a 342 number of decision trees on randomized selections of subset 343 of features. The final output is produced by averaging all the 344 individual decision tree classifiers [52]. The MLP classifier 345 is a neural network algorithm, which includes multilayered 346 nodes with weights [53]. Typically, the network architecture 347 includes an input layer, any number of hidden layer, and an 348 output layer where each layer's nodes are connected via activa-349 tion functions (a so-called feedforward propagation). During 350 the learning process, the weights are reevaluated using the 351 backpropagation iterative algorithm [54] in order to decrease 352 the cost function. 2) Score Metrics: We use six different metrics to evaluate 354 the models' performance [55]. These metrics are: precision, 355 recall, F1, accuracy, Heidke skill score (HSS), and true skill 356 statistics (TSS). These metrics are defined in (2) and are a 357 combination of the four possible outcomes of our classifier. training [58]. The data are divided into k segments or folds 417 with the same size where each fold is being tested by the 418 model and the other k − 1 folds are used as training data. The 419 process repeats k times, where the best score of each fold's 420 validation procedure is used to select the best model. Since 421 most ML models have hyperparameters (e.g., regularization 422 coefficient) which need tuning, the CV step is often performed 423 together with the hyperparameters tuning, a practice that can 424 lead to overfitting [59]. A useful way of dealing with this 425 issue is to separate the CV into two k-fold CV steps, which 426 first tunes the model's hyperparameters and then evaluates the 427 model's performance by estimating the generalization error. 428 This procedure is called double CV or nested CV and is often 429 implemented by using a nested loop, i.e., an inner loop that 430 optimizes the hyperparameter space and an outer loop that 431 estimates the generalization error [60]. Since nested CV uses a 432 lot of computational time, we must balance the recommended 433 number of folds [61] and the hyperparameter space with the 434 computational time. Nevertheless, in order to quantify the bias 435 that a particular selection of k can enter our generalization 436 error, we run two nested CV configurations, one with four 437 inner/outer folds and another with five inner/outer folds. This 438 verification procedure is outlined in Section IV-A and con-439 cludes that there is little bias in selecting either k = 4 or k = 5. 440 Thus, as shown in Fig. 8 function of a particular fold and metric in order to choose the 449 best "mean" hyperparameters. This step measures the model 450 sensitivity to the hyperparameters optimization and is also 451 outlined in Section IV-A. These best "mean" hyperparameter 452 sets, as will be shown in Section III, produce high enough 453 scores and low k-fold variability in all models.

4) Permutation Test:
We also subject our classifiers to the 455 permutation test for labeled data [62]. This test, which has 456 been extensively used in the field of computational biology, 457 aims to address the following question: does the classifier 458 detect a significant class structure, i.e., a real connection 459 between the data and the class labels? We use a standard 460 fivefold CV to estimate a null distribution by permuting the 461 labels in the data and produce a "true" score without the 462 permutations. The experimental p-value from these tests is 463 calculated as follows: where S is the number of permutations whose score ≥ the 466 "true" score. Since ideally, S should be 0, the best possible 467 p-value is 1/(n permutations + 1), and since we use 100 per-468 mutations, it is 1/101 = 0.0099, while the worst p-value is 469 when S = n permutations , i.e., p-value = 1.0.     3) We repeat the evaluation for each of the score metrics.  in this section use the aforementioned set of hyperparameters. 505 We encourage the interested reader to look at Section IV-A 506 where the process of hyperparameters selection is outlined and 507 discussed. 508 Fig. 9 shows the mean test scores and variability due to 509 data splits selection for the SVM, RF, and MLP classifiers 510 and for each metric. For most metrics, MLP has generally 511 slightly worse scores than SVM and RF. For the feature 512 groups, DoY performs poorly, followed by surface pressure 513 that is only second to PWV, which has the highest scores for 514 a single feature group. Adding pressure and DoY only slightly 515 improves the scores for most models and metrics. DoY as a 516 single feature has the highest fold selection variability, while 517 all other features have lower variability.   In many data scenarios, the MDI-based feature importance 560 method contains biases and should not be relied upon [64]; 561 however, it is not clear if this is the case in our work. 562 Nevertheless, we decide to validate our findings using a game-563 theory-inspired method of feature importance based on the 564 Shapley values [65]. We use the SHAP Python package [66] 565 and calculate the mean SHAP values for the three feature 566 groups where the RF is trained with its best HPs. The result 567 is in Fig. 13, showing an almost similar picture as in Fig. 12, 568 where the most important PWV values are from 2 to 6 h prior 569 to a flood with two more smaller peaks in 14 and 19 h prior 570 to a flood.

571
The imbalanced dataset test scores, as presented in Fig. 14, 572 yield a drop in the precision and F1 metrics for all feature 573 groups; however, for all other metrics, i.e., accuracy, TSS, 574 HSS, and most importantly recall, the scores compared to the 575 balanced test (Fig. 9) remain almost unchanged.

577
As to date and to the best of our knowledge, this work 578 demonstrates for the first time the ability to directly pre-579 dict flash flood events from GNSS-derived PWV using ML 580 methodology. Thus, the first part of this discussion, which is 581 presented in Section IV-A, is about the technical validity or 582   Table III shows the best hyperparameters that were used 587 in testing the models using different score metrics, producing 588 ROC curves, and permutation testing. This set of hyperpa-589 rameters was selected using a grid search for the various 590 hyperparameters range, as shown in Table IV. The basic 591 idea is to search for the best hyperparameters per each data 592 split by using different metrics. If we select one set of 593 hyperparameters for all of the data splits and show that the test 594 scores do not vary too much, then we can justify our choice 595 empirically. In addition, if we can also show that the same 596 results hold for different score metrics, it will increase their 597 robustness.

598
Figs. 15-17 show the optimal hyperparameters with a data 599 split and metric breakdown for the SVM, RF, and MLP 600 classifiers, respectively. For all the classifiers, we can detect a 601 change in the hyperparameters for the recall metric, which 602 raises a red flag over its usage. For the SVM classifier, 603 except for the recall metric, the best kernel is radial basis 604 function (rbf) with good estimates of gamma and C values 605 of 0.02 and 1 respectively. For the RF classifier, min samples 606 split parameter is very high for the recall metric compared to 607 the other metrics, while all other parameters do not show any 608 reaction when being optimized by recall. Interestingly, min 609 samples leaf changes in some splits when the CV strategy is 610 four inner folds as opposed to five inner folds. Finally, for 611 the MLP classifier, the hidden layer sizes (NN architecture) 612 parameter is almost evenly distributed, suggesting that this 613 parameter is not important or our search grid is not large 614 enough. For the activation parameter (activation function), the 615 recall metric optimizes this parameter to be logistic as opposed 616 to rectified linear unit (relu) for all other metrics.

617
In order to test how the hyperparameter optimization for 618 each inner fold fares on the test fold, we use the same metrics 619 on each inner fold optimized with the same metric. The results 620 are shown in Figs. 18 and 19 for the four and five inner folds' 621 CV strategies, respectively.

622
If we compare these figures to Fig. 9, we can see that 623 for all metrics except recall, the mean test scores are within 624 their data split variability (the error bars in the bar plot). For 625 recall, SVM finds perfect scores for most features, while MLP 626 finds a perfect score for some features and large variability 627 for others. This apparent instability further lowers the recall 628 metric's reliability in this task.

629
As with the score test, we can make ROC curves along with 630 their AUC scores for each fold individually. Figs. 20 and 21 631 show the ROC curves for the four and five inner folds' CV 632 strategies, respectively.

643
The imbalanced dataset test, which simulates a more rare show a 30% mean improvement (not shown), suggesting that 653 increasing the dataset for both training and testing might 654 improve the classifier's performance for possibly more rare 655 event occurrence. Furthermore, since the F1 metric is very 656 sensitive to both the recall and the precision metrics, a drop 657 in either lowers F1 considerably, and thus, the F1 scores are 658 expected since our precision dropped as well. As expected, 659 the recall metric did not suffer the same decrease as the 660 precision, and thus, we conclude that the classifier performs 661 well at minimizing FNs, i.e., when flash floods occur but no 662 warning has been issued. This low miss rate is extremely 663 important for an early warning system which our classifiers 664 have demonstrated through this test. 665 Fig. 16. Panel showing the optimal hyperparameters that were found using grid search CV for the RF classifier and its hyperparameters. Each set of hyperparameters was found for each outer split (row) and each metric (column). Moreover, the hyperparameters are also optimized when considering (Top) four and (Bottom) five inner folds. Each hyperparameter name is denoted in the title of each top panel and its values are indicated in the accompanied colorbar. Fig. 17. Panel showing the optimal hyperparameters that were found using grid search CV for the MLP classifier and its hyperparameters. Each set of hyperparameters was found for each outer split (row) and each metric (column). Moreover, the hyperparameters are also optimized when considering (Top) four and (Bottom) five inner folds. Each hyperparameter name is denoted in the title of each top panel and its values are indicated in the accompanied colorbar.

666
In the EM area, we found two major studies, which 667 aim to predict flash floods and produce results that can be   [10]; however, the study area is located ≈117 km north 684 of the Dead Sea and large enough (27 2 km) to include the 685 Mediterranean, semiarid, and arid climates. Furthermore, the 686 prediction algorithm they used was made to predict three levels 687 of peak discharge (<14 m 3 /s, 14-50 m 3 /s, and >50 m 3 /s), 688 which can be translated into ML terms as a multiclass 689 classification task. Since in this work, we solve a binary clas-690 sification task, we will present the aggregated results from [3] 691 to get an estimate of their model's performance. Thus, from a 692 total of 20 events, 12 are correctly predicted (TP), 1 is missed 693 (FN), and 7 were false alarms (FP). Since Rozalis et al.
[3] 694 did not include the TNs, we can only consider their TPR that 695 is 0.92.

696
Finally, hydrological models are not meant only for flood 697 prediction but rather a deeper understanding of the underlying 698 physics that drive the flash floods. Unfortunately, our approach 699 here is mostly data-driven and does not present a clear and 700 better understanding of the flash floods phenomenon. Never-701 theless, by using ML methodology, we were able to maximize 702 the impact of the small amount of physics that is hidden in 703 the PWV time series and produce a successful flash flood 704 predictor, which can be used as the basis of an early warning 705 system.

707
We have used nine GNSS ground stations in order to obtain 708 PWV and use it in order to train, test, and validate a classifier 709 Fig. 18. Mean test scores for the SVM, RF, and MLP classifiers (row) and each metric (column). The feature groups consist of DoY (purple), surface pressure (brown), PWV (blue), PWV and surface pressure (orange), and all three together (green). The mean scores are indicated in the top left of each bar and the SD of four data splits is represented by the error bar length. The difference from this figure and Fig. 9 is that in this figure, the hyperparameters were optimized for each inner fold, and thus, there is a different set of hyperparameters for each fold (e.g., Fig. 16) as opposed to one set of hyperparameters in used in Fig. 9 (Table III). Fig. 19. Mean test scores for the SVM, RF, and MLP classifiers (row) and each metric (column). The feature groups consist of DoY (purple), surface pressure (brown), PWV (blue), PWV and surface pressure (orange), and all three together (green). The mean scores are indicated in the top left of each bar and the SD of five data splits is represented by the error bar length. The difference from this figure and Fig. 9 is that in this figure, the hyperparameters were optimized for five inner folds' CV strategy, and thus, there is a different set of hyperparameters for each fold (e.g., Fig. 16) as opposed to one set of hyperparameters in used in Fig. 9 (Table III).  for predicting flash floods in the arid part of the EM region.

710
The conclusions are given as follows. 711 1) Forning them together shows only a slight improvement.

712
2) The ROC curves showed that the SVM model achieved  3) The feature importance plots from the RF model showed 716 that the PWV predictor is the most important one (72%), 717 followed by surface pressure (27%) and DoY (<1%). 718 An hourly breakdown of the PWV predictor shows a 719 major peak from 2 to 6 h prior to a flood, with two 720 smaller peaks on 14 and 19 h prior to a flood.