Detailed Leak Localization in Water Distribution Networks Using Random Forest Classifier and Pipe Segmentation

In this paper, a Random Forest classifier was used to predict leak locations for two differently sized water distribution networks based on pressure sensor measurements. The prediction model is trained on simulated leak scenarios with randomly chosen parameters - leak location, leak size, and base node demand uncertainty. Leak localization methods found in literature that rely on numerical simulations can only predict network nodes as leak nodes; however, since a leak can occur at any point along a pipe segment, additional spatial discretization of suspect pipe is proposed in this paper. It was observed that pipe segmentation of the whole network is a non-feasible approach since it rapidly increases the number of potential leak locations, consequently increasing the complexity of the prediction model. Therefore, a novel approach is proposed, in which a prediction model is trained on scenarios with leaks occurring in original network nodes only, but with its accuracy assessed against pressure sensor measurements from scenarios in which leaks occur in points between network nodes. It was observed that this approach can successfully narrow down the suspect leak area and, followed by additional segmentation of that network area and subsequent prediction, a precise leak localization can be achieved. The proposed approach enables incorporation of various uncertainties by simulating leak scenarios under different conditions. Investigation of leak size uncertainty and base demand variation showed that several different scenarios can produce similar sensor measurements which makes it difficult to unambiguously determine leak location using the prediction model. Therefore, future approaches of coupling prediction modeling with optimization methods are proposed.


I. INTRODUCTION
Leaks in water distribution networks can cause considerable losses, especially in older water distribution networks where considerable investments are needed for restoration. Smaller leaks can remain undetected for longer periods causing considerable water losses over time. Also, in the case of older water distribution networks rapid progression of leak size can eventually cause pipe burst which leads to water outages for end users. Therefore, a number of different techniques are being used to detect and localize leaks. These methods can be divided into hardware-based and software-based methods. Hardware-based methods use in situ visual observations The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk . or measurements. Software-based methods rely on different software for leak detection analysis. Since some methods have been developed for specialized applications depending on the transporting fluid (water, oil, gas, etc.) and location of the pipeline (water distribution network, facility, housing, etc.), a number of papers analyzed the advantages and limitations of the proposed methods and an overview of some of these methods is given in papers [1]- [5].
Software-based methods can be further divided into transient-based methods, model-based methods, and datadriven methods. Transient-based methods rely on analysis of transient pressure wave that occurs when leakage happens. For model-based methods, estimated pressure values are obtained from simulation with no leaks and in-field measured pressure values are compared, i.e. subtracted from VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ estimated pressure values. Obtained residuals are evaluated and if residuals are above the chosen threshold it is considered that a leak is present. Data-driven methods rely on statistical analysis and processing of raw sensor measurement data to obtain information about the presence of leaks and possible locations.
The main problem with the model-based approach is the assumption of the model being a good representation of the network. Water distribution networks have a lot of uncertainties that need to be taken into consideration, such as demand uncertainties, sensor measurement imperfections, pipe diameter uncertainties, etc. Thus, the model-based approach cannot capture all these parameters. The data-driven approach using raw sensor measurements could incorporate all these variations, but the main problem is the number of leak events which are rather sparse. Since the amount of data is small compared to the amount of data needed for efficiently employing machine learning algorithms, models can be advanced by incorporating uncertainties through simulations with varying parameters which can produce additional data.
Machine learning has been used for a variety of water distribution system applications. Prediction of failure of water mains was investigated in [6] where artificial neural network (ANN), ridge regression, and ensemble decision tree were used. Different machine learning algorithms have been explored for the prediction of leak locations in pipelines, such as convolutional neural network (CNN) [7], [8] and ANN [9], [10]. In [11] support vector machine (SVM) method was used to predict leaks in wall-mounted pipelines.
When considering water distribution networks, in [12], a deep learning model based on additional pressure meters installed on optimal places was used to identify pipe burst locations. In [13], SVM was used for prediction of leak size and location based on pressure sensors gathered from EPANET simulations for small size leakages. In [14], leakage detection was conducted for 1500 m × 1500 m experimental network using principal component analysis (PCA) and SVM. In [15], model-based method was used to identify leak event and data-driven approach using k-Nearest Neighbors (k-NN) classifier was used in the second stage to determine leak location. In the further study [16] Bayesian classifier was used with improved localization accuracy. Both methods were applied to real water distribution network case studies. In [17], unsupervised principal component analysis (PCA) approach for leak detection was conducted for the Hanoi distribution network. In [18], Kriging method was used to estimate pressure measurements in the whole network based on the limited number of sensor measurements and classification methods were used to determine leak location. It was shown that the accuracy of the proposed method was very low for some sensor layouts due to Kriging interpolation error. In [19], detection and localization of multiple leak locations were explored. SVM was used as a classifier for leak detection using the residual method and a statistical method was used for leak localization in the Hanoi network.
All mentioned papers assume possible leak locations only in network nodes.
In order to increase the number of input data, in previous work [20] it was proposed that a great number of leak scenarios can be generated by simulating different leak locations and leak sizes under different demand uncertanties. The machine learning approach for leak localization was investigated for variously sized water distribution networks, various demand ranges, and various sensor placements. However, considerable simplification was made insomuch that the prediction model was trained with simulated scenarios in which leak locations occur only in network nodes while in reality leaks can occur at any point along a pipe segment. Thus, in this paper, an approach with pipe segmentation in suspect areas is investigated. The idea is taken from the adaptive mesh refinement approach used in computational fluid dynamics (CFD) simulations, where the area of interest is refined with additional numerical nodes in order to increase the accuracy of results. An alternative approach of fault zone identification has been used in work by [21] and [22]. However, that approach could be problematic for leak locations at the borders of leak zones since water distribution network needs to be divided into zones before using leak localization method. The approach proposed in this paper identifies suspect nodes from machine learning prediction model, which then serve as indicators for pipes that need to be further explored using segmentation. Therefore a possible leak area is adjusted for each leak event based on prediction results.
In the first part of this paper, it is investigated whether a prediction model trained only on simulations with leak locations in network nodes can successfully predict leaks that occur in-between network nodes. Two differently-sized water distribution networks, Hanoi and Net3 were used for this, coupled with various sensor layouts, leak sizes, and demands. Furthermore, the accuracy of sequential prediction models in predicting leak location was investigated. The prediction model performance is investigated when several most-suspect nodes are considered and segmentation of pipes near those suspect nodes is performed. The subsequent prediction model is trained on scenarios with leak locations in most-suspect network nodes and in nodes added through pipe segmentation from the previous stage. Limitations of the proposed method and future work are presented in the discussion section.

A. PROBLEM STATEMENT
Leak localization methods based on machine learning methods require considerable amount of data for model training. Since the measurements for real leak events are rather sparse, additional data can be obtained by simulating different leak scenarios. For this purpose, leak scenarios were simulated using EPANET version 2.0.12. [23] with various leak scenario parameters. Leak location, leak size, and node demands were chosen randomly to cover a wide range of possible leak events. Typically it is assumed that water distribution network models are calibrated and that leaks can occur only in network nodes. The latter assumption can be problematic for water distribution networks with longer pipe segments since localization will be a very rough estimate. Therefore, additional pipe segmentation is introduced which divides a pipe into smaller sections, allowing better leak localization. Random Forest machine learning algorithm is trained with pressure sensor measurements from simulated scenarios and is then employed to determine most suspect leak locations.

B. WATER DISTRIBUTION NETWORKS
The investigated water distribution networks are small-sized Hanoi network and medium-sized Net3 network. Hanoi (Vietnam) network with 31 nodes was obtained from The Centre for Water Systems (CWS) at the University of Exeter [24]. For Hanoi network, demand patterns as described in [17] are adopted ( Figure 1). Net3 network is an EPANET example network for dual-source system that changes over time, consisting of 92 nodes. For both networks simulation time was 24 h, hydraulic time step was 10 min and report time step 1 h. To generate a wide range of possible leak scenarios, emitter coefficient and leak location were chosen randomly. Additionally, to incorporate demand variation, it was randomly decided whether node base demand was to be changed or not. If it was chosen to be changed, base demand was increased or decreased by randomly chosen percentage in the range ±2.5% or ±5%.
For each water distribution network, two different sensor layouts were considered. For Hanoi network, the first layout has two sensors located in network nodes 14 and 30, as given in [15], and the second layout has three sensors located in network nodes 8, 20, and 31, as given in [17] (Figure 2). For Net3 network, the first layout has four sensors located in network nodes 117, 143, 181, and 213, and the second layout has two sensors located in network nodes 117 and 181 ( Figure 3).

C. PIPE SEGMENTATION
Discretization of water distribution network pipes was achieved by inserting additional network nodes, where each pipe was split on 5 segments of equal length, resulting in additional 4 nodes for each pipe ( Figure 2). Although it would be more beneficial to define a fixed segment length, a fixed number of segments was used as a methodological simplification.
To investigate machine learning efficiency in the localization of leak locations in pipe segments, three different models were analyzed. Model 1 was trained and tested on leak scenarios with leak locations in original network nodes. Model 2 was trained and tested on leak scenarios with leaks located both in network nodes and refinement nodes, resulting in a significantly increased number of ML output classes. Finally, Model 3 was trained on scenarios with leaks in original network nodes, but it was then tested for scenarios in which leak locations can be both in network nodes and refinement nodes.
Flowcharts of the proposed models can be observed in . . , N s ns } where superscript s denotes segmentation nodes, subscript no denotes total number of original network nodes and ns total number of segmentation nodes. The sensor measurements S i ∈ {S 0 (t), . . . , S n (t)}, were n indicates total number of sensors for considered sensor layout, were recorded through time t, namely 25 timesteps in all considered cases. Since the model 3 is trained only on the original network nodes it cannot possibly predict a refinement node. Thus the refinement nodes are considered to be predicted correctly if their nearest original network node N i ∈ {N o 0 , . . . , N o no } was predicted. This simulates a most realistic scenario where leaks can occur anywhere in the pipe segment, however, the model can be trained only with scenarios with leaks in network nodes we have in the model.

D. RANDOM FOREST CLASSIFIER
Machine learning (ML) algorithms are being used to find underlying correlations or patterns from obtained data. This ability enables machine learning algorithms to provide a prediction for unseen data, which can be categorized into regression and classification problems. Regression algorithms are designed to provide a prediction of the exact value of the output variable, while classification algorithms separate data into logical groups, i.e. classes.
Random Forest classifier was first proposed by [25] and is an ensemble type of algorithm based on multiple decision trees which are created as independent prediction models. Decision trees (DT) are constructed in a form of flowchart structure, where nodes represent attributes used for outcome prediction. Based on feature values a decision is made at each node and ultimately based on these decisions classification is reached. Each tree is defined with tree depth parameter which defines how many splits can be made before making a prediction. Random Forest uses bootstrap and aggregation methods to obtain unique data subsets for the training of each decision tree and to ultimately count the class with the most predictions. Increased number of trees increases the precision of the classifier, albeit also increasing its complexity. The problem considered in this paper is the classification problem since each potential leak node represents one class, thus Random Forest classifier was adopted as a suitable ML method. Random Forest classifier implementation in the Python library Scikit-learn [26] version 0.20.3 was used.
The dataset is composed of 500 000 inputs, with training-testing split 70%-30%, resulting in 350 000 training records and 150 000 testing records. It was observed in [20] that a smaller timestep only slightly increases prediction accuracy so timestep of 1 hr was adopted in order to reduce number of features and reduce computational time.
Grid search optimization of Random Forest parameters was conducted for Hanoi network with 100 000 inputs with leak coefficient range 10 . . . 15 and with ±2.5% demand variation in order to find optimal number of estimators (trees), maximum depth, and minimum number of samples required to split an internal node. It was found that the optimal minimum number of samples required to split an internal node is 2, the optimal maximum depth of the tree is 20, and the optimal number of estimators (i.e. trees) is 200. These parameters are kept constant for all investigated prediction models. Other Random Forest parameters were kept at default values of the Scikit-learn implementation. For each prediction model, five runs were conducted to consider the influence of prediction model parameter randomness and average accuracy values are reported. Additionally, model accuracy was measured for true leak node presence in top 3 and top 5 suspect network nodes with greatest prediction certainties. Even if true leak node is not correctly predicted, presence of true leak node in top 3 or top 5 most suspect nodes considerably narrows down the area of leak location.

A. EFFECT OF PIPE SEGMENTATION
Hanoi network with two sensors, emitter coefficient range 10 . . . 15 and different demand variations was investigated first. In Model 1, where leaks can occur only in the original network nodes, 31 prediction classes were obtained. For Model 2 each pipe segment is divided into 5 segments of equal length, resulting in 163 prediction classes. Although Model 3 was used for predicting leak scenarios on segmented network of Model 2, it was trained on leak scenarios used for Model 1. Thus, in Model 3 the 31 prediction classes corresponding to the original network nodes were used, with leaks in the 135 segmentation nodes expected to be classified as leaks in their nearest original network nodes.
Results for the conducted investigation are presented in Table 1. It can be observed that with the increase of demand variation, model accuracy considerably decreases; indicating a rapid increase of possible scenarios which are consequently difficult to predict. However, when top 3 and top 5 suspect network nodes with the greatest certainties are considered, model accuracy is high. For Model 2, where 163 network nodes are possible prediction classes, model accuracy is very low, indicating that for greater networks this approach would require even more data and computational resources, which is currently not feasible. Model 3 accuracy is reduced compared to the Model 1 approach, which is expected as segmentation nodes increase the total number of possible leak locations. Furthermore, leaking in the segmentation nodes in the middle of the pipe could provide flow patterns that could be equally similar to flow patterns produced by leaking on one or the other edge node of that pipe, thus also contributing to reduced accuracy. However, when top 3 and top 5 suspect nodes are considered, the difference in prediction accuracy for Model 1 and Model 3 shrinks to only around a couple of percents. Although the proposed ML approach demonstrates modest accuracy in predicting the exact leak locations, the proposed approach can be successfully used to narrow down the leak location.
The same investigation was conducted for Net3 network with 4 sensors, emitter coefficient range 10 . . . 15, and for different demand variation ranges. Model 1 and Model 3 are created with 92 classes, while Model 2 was also created with 5 additional segments per pipe, resulting in 544 classes altogether.
Results for Net3 are reported in Table 2. It can be observed that prediction model accuracy for the Net3 network is significantly lower than for the Hanoi network. For a model with no demand variation, it is around 7% lower than for the Hanoi network and with an increase in demand variation this decline is over 20%. This is expected, since the Net3 network has a greater number of network nodes and consequently a greater number of possible leak locations. Model 2 accuracy is very small, especially for the strongest variation of demand, as it was observed for the Hanoi network, confirming this approach is not feasible. However, although Model 3 has reduced accuracy when compared with Model 1, when considering top 3 and top 5 nodes the accuracy of Model 3 comes very close to the accuracy of Model 1, indicating that the Model 3 approach could be successfully used in a real leak scenario.
Considering these results, only Model 1 and Model 3 will be considered in further research.

B. SENSOR AND EMITTER COEFFICIENT INFLUENCE
The investigation was conducted for various sensor placements, number of sensors and emitter coefficient ranges. The results for the Hanoi network are presented in Table 3. It can be observed that overall prediction model accuracy decreases with greater coefficient range. This is expected since a greater coefficient range increases the size of the problem solution space. On the other hand, with a greater number of sensors,  prediction model accuracy slightly increases. Additionally, the greatest difference in Model 1 and Model 3 accuracy appears for scenarios with no demand variation, ranging from 15% to 19%. However, as demand variation increases, the accuracy difference falls to 8 . . . 12%.
The results for Net3 network are presented in Table 4. Same as in the Hanoi network case, with a greater range of emitter coefficient both Model 1 and Model 3 accuracy decrease, for both sensor layouts. Same as in the Hanoi case, as demand variation increases the difference between Model 1 and Model 3 accuracy decreases and again the greatest difference in model accuracy is for no demand variation.

C. PIPE SEGMENT SEGMENTATION INFLUENCE
In order to investigate pipe segmentation influence in the Model 3 approach, three different discretizations are considered for the Net3 network with 4 sensors. Pipes were divided into 3, 5, and 11 segments, resulting in 318, 544, and 1222 possible leak locations, respectively. The results are presented in Table 5. It can be observed that a finer network segmentation slightly reduces model accuracy, which is entirely expected since the number of prediction classes rises with greater refinement. Also, it is expected that at some point further refinement would lead to scenarios with different leak nodes but almost identical pressure readings, since these nodes may happen to be situated very close to each other. However, the rather small decline in accuracy indicates that the proposed approach can be successfully used to narrow down a leak location.

D. ACCURACY IMPROVEMENT
The number of top suspect nodes which need to be considered to achieve 99% model accuracy was investigated to increase accuracy of the prediction model. This approach was already used in [27] to localize the source of pollution and similarly in [13] where the correlation between accuracy and distance between predicted and actual leak node was presented. In this way, a considerable number of network nodes is eliminated, thus the leak area can be localized with considerable certainty even for sparse sensor placement and greatest demand variation.
Number of needed top nodes for Hanoi network is presented in Table 6. It can be observed for Model 1 that with the increase in demand variation, a greater number of top nodes needs to be considered to achieve 99% accuracy; however, considerable localization is achieved even for the strongest demand variation. Similar behavior can be observed with Model 3, where the greatest number of top nodes needs to be considered for the greatest demand variation. Also, a number of top nodes comparing to Model 1 is slightly greater, which is expected. In Figure 5, the increase of model accuracy with the increase of considered top nodes is illustrated. It can be observed that for all models the accuracy of 90% is surpassed when using only top 4 nodes. Additionally, a rapid increase in prediction model accuracy is observed when including the first several top nodes. However, after some threshold the additional nodes in the top list only slightly improve the overall model accuracy.
This kind of investigation has also been conducted for the Net3 network, and the results are presented in Table 7. The number of top nodes is greatest for Model 3 and for stronger demand variation, which is expected and consistent with Hanoi results. It must be noted that even for the worst performing model, with emitter coefficient range 5 . . . 15 and demand variation ±5%, 32 top nodes represent only 35% of  all network nodes, which is still a considerable localization. Additionally, it must be taken into consideration that the chosen 99% accuracy threshold is very high, where the strong model accuracy manifests even for the smaller number of top nodes ( Figure 6). To further evaluate the proposed model, the sequential prediction modeling approach is evaluated in the next section.

E. REALISTIC SCENARIO TESTING
To further evaluate the proposed ML approach, an investigation was conducted for a simulated case on Net3 network with 30 records which represent 30 different days. Scenarios are generated with fixed leak location and leak coefficient, but with different demands in network nodes obtained through base demand variation of ±2.5%. Two different leak locations were chosen, first with leak location in network node 159 ( Figure 7) with emitter coefficient set to 10, and second with leak location in a pipe segment between nodes 205 and 207 ( Figure 8) and with emitter coefficient set to 15. The initial prediction was made using Model 1 with emitter coefficient range 10 . . . 15 and base demand variation of ±2.5%. From previous investigation (Table 7) it was observed that when leak locations in pipe segment nodes are allowed, the top 12 nodes achieve 99% accuracy, thus 12 nodes with the greatest prediction model certainty are considered for further segmentation and secondary Model 3 predictions.
For each of the 30 records different certainties are obtained, i.e. the top 12 nodes could be different for each record. Therefore, the average value of all 30 certainties for each node was chosen as a measure for choosing the top 12 nodes with the greatest certainty. For leak node 159, the greatest model certainty is obtained for true leak location, where for leak node in pipe segment between nodes 205 and 207 the greatest certainty is obtained for leak location 207 which is the edge node of the considered pipe segment. Suspect  For the next stage, additional pipe segmentation was performed around these top 12 nodes and a prediction model was created where possible leak locations were the top 12 network nodes plus the newly inserted nodes. At this stage, for leak location 159, the most suspect node was node 60, and the second candidate node was node 159 which is the true leak node. For leak location in pipe segment between nodes 205 and 207, the most suspect node was node right next to the true leak node and the second candidate was the true leak node. Top 3 most suspect nodes for both considered cases can be observed in Figures 9 and 10.
The third sequential prediction model was trained also on the top 12 nodes with the greatest average certainty from the previous stage. Both considered cases have true leak location as the second most suspect node. Additionally, from  Figures 9 and 10 variation in top 3 most suspect nodes can be observed, showing that an unambiguous solution cannot be obtained. This indicates that for different leak locations, demands, and emitter coefficients, still a very similar pressure measurement can be obtained. In other words, there are multiple solutions to the problem. It is shown that the prediction model can efficiently localize leak areas for sparse sensor placement for leak locations which can occur anywhere in pipe segments. However, due to wide range of leak scenarios that are used for prediction model training, a prediction model for fine localization may not be able to provide a single solution.

IV. DISCUSSION
It is shown that the proposed ML approach can be successfully used for localization of leak area under demand uncertainty, for different sized networks, and for different sensor placement layouts. ML model for segmented network pipes was investigated to take into consideration that leaks can occur anywhere along a pipe, but it was shown to be an unfeasible approach. Any pipe segmentation considerably increases the number of network nodes, i.e. number of prediction classes, with consequently rapidly increasing computational complexity. Additionally, a greater number of inputs is required, which is a considerable problem for greater networks. However, it seems that leaks that occur in pipe segments can be successfully localized with a prediction model trained only on scenarios generated for original network nodes, especially when several top most suspect nodes are considered. It was also observed that regardless of pipe refinement, similar prediction accuracy can be obtained. However, as was mentioned before, a simplification was made where all pipes, regardless of their length, had the same number of divisions. Therefore in future work, fixed lengths for additional refinement nodes should be explored to further explore the presented approach and align the proposed technique with practical purposes.
Sequential prediction models were tested, where the first prediction model specified area for further segmentation, and subsequent models were used to find the exact leak location. It was observed that ML has a problem with detecting fine differences in leak scenarios; the true leak location was always in top nodes but was not always the node with the greatest model certainty. This can be explained by the fact that machine learning models need to cover a large span of scenarios (different demands, different leak sizes, etc.), thus it is reasonably expected that several equally good solutions exist. Similar observation was made in [15] where some leak locations were grouped in single classes, since distinction between locations could not be made.
In further research, coupling of ML and optimization methods needs to be explored. Genetic algorithm (GA) was explored in [28] for leak localization using the inverse transient method for a network with 7 nodes. The main problem with optimization methods in water distribution networks is the network node variable, which is a categorical variable and as such makes the optimization problem very complex and computationally demanding. However, if ML is used to localize a leak area, independent optimizations for suspect nodes can be conducted and thus reduce the optimization complexity. This was successfully applied in [29] where the pollution source was localized and independent optimizations were conducted to obtain a true pollution source. Additionally, if the optimization method is to be employed, network demands could be more carefully monitored for some period, for example from 2 to 3 AM as proposed in [13], to eliminate or reduce demand variation which is shown to considerably decrease prediction accuracy.
It must be noted that Random Forest classifier was chosen due to its simplicity and since it allows for a reasonably reliable prediction without method parameter fine tuning. For example in [30] RF classifier outperformed SVM, ANN, k-NN and DT for leak detection using acoustic signals, however extensive analysis of classifier parameters was not shown. In [31] six deep neural networks structures and three RF classifier were compared for source tracking of chemical leaks and best accuracy was achieved with RF classifier. Additionally, in [32] Gradient Boosting, DT, RF, SVM and ANN models were investigated for detection of leaks in natural gas pipelines where models were tuned to ensure no false alarm. ANN and SVM showed best performance, however RF and DT were most sensitive to detect small leaks. Therefore, it can be concluded that other models such as ANN may outperform Random Forest algorithm if fine-tuning of hyper-parameters is conducted. Novel ANN methods which deal with this ANN complexity are being developed such as quantum-inspired neural network Autonomous Perceptron Model [33] which showed better performance than other algorithms, including classic ANN and RF. Therefore, extensive investigation of other machine learning algorithms should be conducted in future work to determine which classifier can provide best model accuracy for leak localization problem in water distribution networks. Dimensionality reduction methods should also be explored to reduce the number of features, consequently reducing prediction model complexity which could be important for bigger water distribution networks.
The proposed methodology could provide real-time support in water distribution network surveillance. The prediction model can be prepared with incorporated demand uncertainties, and can therefore be continuously used to detect when a single leak location is repeatedly reported. However, future work should investigate the possibility of identification of multiple leak locations, which is also most often the case. Other uncertainties should also be incorporated, such as sensor measurement uncertainties and model uncertainties such as pipe diameter and pipe roughness. Ultimately, the proposed methodology should be tested on real-world water distribution network data where all these uncertainties are present.

V. CONCLUSION
In this paper, machine learning approach using big data obtained from computer simulations was investigated for leak localization in water distribution networks. In previous research, a simplification was made in which leaks were only occurring in network nodes and here the methodology is enhanced by allowing for leaks to occur anywhere on any network pipe. It was observed that global refinement of the network in which segmentation is performed on all pipes is not a feasible approach, since the number of potential leak locations rapidly increases and construction of a capable machine learning model is currently computationally too demanding.
However, only a small reduction in model accuracy is observed when the prediction model is trained exclusively on scenarios with leaks appearing in network nodes, while the prediction is then given for leak scenarios with leaks in pipe segments. Further investigation showed that this reduction in model accuracy can be compensated by considering several most suspect nodes. This approach significantly narrows down the leak area, especially if larger water distribution networks are considered. These observations indicate that the proposed approach could be applicable in real-world water distribution networks and further study of the proposed approach should be conducted.
In future research, additional model uncertainties regarding pipe roughness and pipe lengths should be included. Since it is observed that increasing demand uncertainty rapidly decreases model accuracy, an additional approach should also include dimensionality reduction of input data. Sequential prediction models were also explored, where further prediction models were trained using only leak scenarios for most suspect leak nodes from the previous prediction model. This approach was shown not to be beneficial since prediction models provide a generalized model, and further leak localization needs a specific solution. Coupling the proposed methodology with an optimization procedure could provide better results, which should be explored in future work. He is currently employed as a Project Engineer at Flowtech d.o.o., and he is also an External Associate with the Faculty of Engineering, University of Rijeka. His research interests include computational fluid dynamics and renewable energy.