Development of an Algorithm for the Estimation of Contamination Sources in a Water Distribution Network

The complexity of a water distribution network (WDN) allows human imposition where accidental or intentional attack is possible. These attacks sometimes result in the contamination of water that has been treated at the treatment plant, and can eventually, be consumed by the society. However, the use of contaminated water has gross negative public health and socioeconomic implications on the society. Technically, being able to identify the source of contamination is particularly important for decision makers, so as to take immediate control strategies in order to minimize the consequences that can ensue from the use of contaminated water. There are two types of WDN analysis problem, which are: the steady state and the transient state conditions. In order to detect the continuous contamination that may be present in a WDN, this study considered a steady state condition. In this work, an approach for estimating the sources of contamination and the magnitude of concentration of the contaminant is proposed. Given a set of measurements, and by applying superposition technique, a model that embeds and relates the contaminant distribution to a set of given measurement in order to estimate the sources of contamination is formulated and, algorithm for solving it, is developed. The application of the proposed model is demonstrated on a water network with multiple injection contamination sources. The results of the estimated corresponding coefficient of determination for three cases are estimated (Case 1: 0.99894, Case 2: 0.99937 and Case 3 is 0.99974) while the corresponding root mean square were also obtained (Case 1: 0.000364, Case 2: 0.000351, Case 3: 0.000299) for a noise level of (5%). The same parameters were also obtained at a noise level of (10%). The obtained results verified the feasibility of the proposed novel approach, which can be applied to a larger water distribution network.


I. INTRODUCTION
It is evident that there is no life without water, particularly, potable water, which remains unavoidable. Therefore, potable water is an essential commodity for human existence and sustainability. A water distribution network (WDN) is a multifaceted infrastructure that consists of hundreds of pipes, connected at junctions (nodes), and reservoirs where human The associate editor coordinating the review of this manuscript and approving it for publication was Xujie Li . imposition is possible. WDN is therefore, vulnerable to both intentional and accidental attacks. In addition, the sources of water and the treatment practices have significant contributions to water quality.
Usually, water quality is examined at the treatment plants, but its quality may, technically deteriorate or get contaminated during transportation to the consumers [1]- [4]. There is a tendency that contaminants may intrude into a WDN, through: storage tank, pipe leakages, during repairs and maintenance, joints or nodes, and cross-connections.
The frequency of contamination occurrence in drinking water system has constituted a significant threat to societal instability and safety [5]. This has become a prime concern for water operators and consumers alike, and has emerged as a technical challenge for the water industries and researchers that must be addressed. Recently, Adedoja et al. [6] reported on the socio-economic and gross health consequences of consuming contaminated water by the society, and posited the possibility of its future occurrence.
In order to prevent and curtail the severe consequences that can ensue from the contamination events, the installation of water quality monitoring system across the drinking water system, has been recommended, particularly, for contaminant detection purposes [6]- [11]. In recent times, this has become necessary for drinking water security emergency strategies. Technically, the installation of sensors across water networks, has a tendency to detect contamination events by the water quality monitoring systems. These sensors are installed at key nodes of the water network for monitoring purposes. However, the procurement, installation and maintenance cost of sensors, have restricted the quantity of sensors to be deployed on the large-scale water networks [12]. The need to address this issue has led to a new body of knowledgeoptimal sensor placement problem. The techniques of optimal sensor placement can be compared to the principle of leakage detection and pressure control valves management placement [13]- [15]. As a matter of fact, different methodologies have been proposed to address the succeeding issue of the optimal sensor placement [6]- [11]. In spite of these notable efforts, this problem is yet to be unified, and there are identified research gaps that require further attention [8].
Even though the water monitoring sensors can detect contaminants, estimating or identifying the sources of contamination and its magnitude, remain a topical issue facing researchers that has attained utmost priority in this domain. This is because, information about the sources of contamination is paramount for effective management of contamination-related threat and, in order to ensure immediate control measures are made available that will curtail the consequences on the public. Therefore, both accidental and intentional contamination in WDNs, have severe consequences that must be addressed through a suitable contamination source identification (CSI) technique. Predominantly, CSI is characterised by using the information collected from the water quality monitoring system, to compute: source location, time of injection, and its magnitude. Many researchers have proposed various methodologies in order to identify the sources of contamination. These methods can be categorized as: particle backtracking model [16], data mining [17], simulation-optimization model [18], probabilistic approach [19] and, so on [20]- [24].
Early studies on CSI described the use of particle backtracking approach. For instance, Shang et al. [16] proposed a particle backtracking method and treated contaminants as particles that reversely backtrack the source from the detection to contamination source. Laird et al. [25] used this method to study contamination in a small-scale network. They alluded that the method will be ineffective for a large water network. The particle backtracking algorithm described by De Sanctis et al. [26] for contamination source, compared the information between water power and water quality. However, this method failed to account for inaccurate information from sensors. This is similar to the report of Costa et al. [27] who weighed-up the sensor information in forward and reverse manners. They inferred that the influence of misinformation of sensors may affect the performance of the proposed algorithm.
Other authors discussed the use of machine learning algorithm [18], [28]. This technique was explored in the work of Huang and McBean [17], in an effort to locate contamination source. They reported a quick search of probable contamination source, but, with a constraint of effectiveness only, when there are multiple sources. Perelman and Ostfeld [29] proposed a cluster-based method. Their method considered the connectivity of flow directions and subsequent evaluation of the probability, through the Bayern theorem. Rather than the node identification, the method was only viable to locate the cluster of contamination around the network. Wang and Jin [19] used the Markoc Chain Monte Carlo (MCMC) model to select samples from the random data and then employed the Bayern theory to calculate the possibility of each node as a contamination source. An excessive computational burden was encountered with this method. In order to reduce the computational burden experienced by the method presented by the work of Wang et al. [30] applied an hybrid technique comprising of MCMC model and the vector regression. Nonetheless, the computation burden increases as the network size increases. Consequently, the authors concluded the need for an improvement and further investigations.
Some researchers have also treated CSI as simulationoptimization problem. For example, Guan et al. [31] employed a non-linear simulation-optimization method to solve the source identification problem by consistently reading sensor data to locate the contamination source. An adaptive dynamic optimization model, based on evolutionary algorithm, has also been described [32]. The authors reported that an optimal solution is possible, but limited due to the slow convergence speed. Yan et al. [33] proposed a cultural algorithm so as to address the concern and hence, demonstrated the technique on three water networks. The results showed the capability of the method, but computational stress and other uncertainties (e.g. stochastic demands), were the setbacks of the method. Yan et al. [34] applied a hybrid encoding method in an effort to improve the accuracy and the convergence speed. In addition, in order to address the excessive computational stress, Laird et al. [35] formulated a dynamic optimisation approach, based on a sub-domain model. However, the exclusion of some key information during the selection of the sub-domain, made this method less effective.
The use of hybrid methods have also been described. In this regard, Preis and Osfeld [36] used an integrated model VOLUME 8, 2020 tree-linear programming, while EPANET tool developed by Rossman [37] was incorporated. A combination of generic algorithm (GA) with EPANET was presented by the same authors [38]. This method has an excessive computational stress that even required the use of parallel computing. Tao et al. [39] employed a probabilistic approach, based on information from consumers. The information from the complaints of consumers, failed to account for the variation in water demands, nodes junctions etc., which make the technique unreliable. The use of artificial neural network (ANN) was also reported by Kim et al. [40]. Other methods, such as: the Bayesian sequential techniques and, evolutionary computations, have also been reported to be relevant [20], [41].

MOTIVATION AND CONTRIBUTION
In spite of the numerous efforts to address this challenge, none is yet to consider an integration of contaminants into hydraulic analysis, which accounts for stochastic demands and the pressure drop that are connected to real water network system. Besides, the excessive computation remains unsolved and a concern that must be addressed. These are notable areas that should be explored with a possibility to mitigate the technical hiccups. However, the water distribution network analysis problem may be of two types, which are the steady state and the transient state conditions [42], [43]. As part of the effort to resolve some of these challenges, the authors have developed a model for the computation of the distribution of a contaminant from one or more sources and an algorithm for solving the model was presented. The feasibility of that approach was demonstrated on a four water networks, with satisfactory results. As a result of this and for the first time, this study proposes an approach for estimating the sources of contamination in a water distribution network under steady state conditions. Given a set of measurements, and by superposition technique, a model that embeds and relates the distribution of contaminant to a set of given measurement is formulated and proposed. Thereafter, an algorithm for estimating the sources of the contamination and the magnitude of the contaminant is then presented. The remaining part of this paper is structured as follows: Section 2 discusses the proposed concentration estimation model, while the detection of the contamination source is highlighted in Section 3. In Section 4, the application of the proposed model is presented. The discussion of results are detailed in Section 5. Conclusions and future works are provided in Section 6.

II. CONCENTRATION ESTIMATION MODEL
This section presents the proposed concentration estimation model. Given a set of contamination node sources and assuming the contaminant is constant, a model is developed that relates the concentration of the contaminant at each source to the measured contamination at various nodes in the network. An in depth analysis of the related previous work, can be found in the work of Adedoja et al. [1]. In addition, an unpublished internal note of Hamam [2], formed a solid background to this formulation.
Considering a unit contamination concentration β 1 = 1, at a source node i only and compute the contamination concentration vector i where i k is the concentration at node k due to β 1 = 1 injection at node i. Given a vector of injections at the set of nodes I and due to the linearity of the distribution of the contamination, while the network solution is solved, it is possible to write: In a matrix form and considering the matrix with columns i ∀i ∈ I Equation (1) becomes Given a set of measurementsδ m δ m , ∀m ∈ M and, a matrix M, such that By combining Equations (3) and (4), we havē Let = M Note that the matrix is obtained by selecting the set of lines M of the matrix -computed contamination concentration at nodes i.e. selecting the lines corresponding to the measured nodes. Then, By applying a least square approximation method to Equation (6), yieldsβ whereβ is the estimated value of the contamination concentrations at the injection nodes.

III. DETECTING THE CONTAMINATION SOURCE
In this section, an algorithm for finding the most probable source of contamination is proposed. The simplest contamination detection is where the contamination is at one node only. However, if the same contamination is at more than one node, then the contamination in the network is represented according to the expression given in Equation (7). At first, it is assumed that there is only one source of contamination and then, multiple sources of contamination is considered. It is important to state that, this work considered a steady state condition, which corresponds to the case where we need to detect the continuous contamination injected into a network.

A. SINGLE CONTAMINATION SOURCE
For this case, the description is that, the contamination is at one node only and the objective is to determine the most probable node of contamination. Consider that the contamination is at node ''i''. If a unit concentration β i is injected at this node, then Equation (6) becomes Equation (8): where i is column ''i'' of the matrix . Therefore, Equation (7) becomes Equation (9).
Equation (9) gives the estimated value of the contaminant injection at node ''i''. The sum of the square of the error is then computed from this injection and the values at the measured nodes. By repeating this process for all the possible injection nodes and by comparing them, the most probable injection node is obtained. This process may then be complemented by computing the standard deviation of the value for each injection, which improves the decision of the accuracy of the choice.

B. MULTIPLE CONTAMINATION SOURCE
By starting with a two source scenario, it is possible to enumerate the two injection nodes at a time and select the most probable two nodes. This simple process is time-consuming and better strategies need to be examined. Therefore, the simplest strategy is considered and for this, a greedy algorithm, is used, and its flowchart is presented in Figure 1.

IV. APPLICATION OF THE PROPOSED MODEL
In order to test the application of the proposed model, this work used a water distribution network of four hundred and forty two (442) as a case study (Figure 2). The network contains 3 reservoirs, and 295 nodes after the redundant nodes have been removed. The data defining the network characteristic is available in the work of Adedeji [44].
This application considered three cases. Case 1 deals with the computation of the measured contamination data that were generated with an injection of two noise levels: 5 and 10%. These noise levels were injected at the source locations viz: 1, 164 and 116. The contamination level at source 1 is 4%, at source 164 is 2% and at source 116 is 1%. These sets of contamination concentration levels were injected at different nodes, in order to trace the flow of the concentration distribution, and their measurements at  various nodes. In other words, injecting same concentration level may not reveal circumstances, where two or more contamination concentrations mix at any node.
For Case 2, a contamination level of 4% is injected at source 1, 2% is injected at source 164, while an injection of 0% is injected at source 116. This case also considered two different noise levels (5 and 10%) in order to compute and generate the associated or the corresponding output. In Case 3, only source 1 has a contamination injection level of 4%, while sources 164 and 116 have 0% injection. Here, two different noise levels (5 and 10%) were considered. In this work, the computations and results are performed in MATLAB software environments.

V. RESULTS AND DISCUSSIONS
This section presents the results and discussions of the cases considered in this study.
The results for Case 1 is presented in Table 1. The corresponding estimated coefficients for both noise levels VOLUME 8, 2020 (5 and 10%) can be observed different for these three sources as presented. The estimated results for (5%) is relatively lower than the real injected values .i.e. 4, 2 and 1, while that of (10%) is slightly higher than the injected values. In this case, other results obtained-in terms of the root mean square error (RMSE), standard error (SE), test statistics (tStat) and probability value (pValue)-are also different for the two noise levels. In addition, the coefficient of determination R 2 -0.99894 and the adjusted coefficient of determination of R 2 adj − 0.99872, were obtained for noise level (5%). For the noise level (10%), R 2 is 0.99567 while, adjusted coefficient of determination of R 2 adj − 0.99474 is recorded. A RMSE value (0.000364) is obtained for (5%) noise level and, RMSE value (0.000758) for (10%). Based on these results, sources: 1, 164, and 116 are the true contamination sources in the water network ( Table 6).
The results obtained for Case 2, show that the estimated contaminated values are relatively close to the real injected contamination values, against sources 1 and 164 for both noise levels (5 and 10%) ( Table 2). However, it was observed that the SE for the source 116, is higher than the estimated value. In addition, the tStat for source 116 can be observed to be very low when compared to sources 1 and 164. Furthermore, the higher pValue recorded against source (116) is an indication that this source is not the true contamination source. Therefore, it is reasonable to eliminate Source 116 for re-validation.
The two probable sources: 1 and 164, were used to reevaluate the source of detection procedure and the result is presented in Table 3. The results of these two sources, show that the true sources of contamination are at locations 1 and 164. Accordingly, R 2 is 0.99937, R 2 adj is 0.99928 and RMSE is 0.00034 were obtained for 5% noise level. For 10% noise level, R 2 is 0.99718, R 2 adj is 0.99680 while RMSE is 0.000734 (Table 6). These results show that the 0% injection at source 116, has no implication on the network. The results obtained are a testament or proof of relevance of the proposed model. In the third case, the results for both the 5 and 10% noise levels, relatively showed a higher Pvalue for sources 164 and 116 when compared to the real injected contaminated value of 0% (Table 4). In addition, the estimated contamination values obtained for these two sources: 164 and 116, are insignificant. This shows that these sources (164 and 116) are not the true sources of contamination, hence, these sources were eliminated and the performance of the proposed model was re-evaluated.  The successive outputs (Table 5) for both the 5 and 10% noise levels, show that only Source 1 with a 4% contamination injection is the true source of contamination.
Other results for this case are in term of: R 2 -0.999739, R 2 adj-0.999723 and RMSE-0.00028 for noise level (5%) while R 2 -0.9989, R 2 adj-0.9988 and RMSE-0.000578 for noise level (10%) were obtained, and are presented in Table 6.

VI. CONCLUSIONS AND FUTURE WORKS
In this work, a novel model for the estimation/identification of contamination sources in a water distribution network, is presented. This model has been applied on a water distribution network with the presented greedy algorithm. The results obtained showed the feasibility of the proposed strategy. Based on these results for the three cases considered, it can be concluded that the model developed is efficient and effective for contamination source estimation/identification, in a water distribution network.
The proposed model is generic. i.e. it can be applied to both small and large network cases. However, the inability to obtain large water network data is the present limitation of this work. Due to the quadratic nature of the objective function and the linearity of the constraints, future work will deal with the extension of the method to use mixed integer quadratic programming and compare with the heuristic approach. In addition, the greedy algorithm proposed can be further examined by comparing its results with those of the mixed integer quadratic programming approach. Most importantly, this study is significant and can serve as a guide in the formulation of an effective optimal sensor placement technique for water supply system. This is another subject of interest, which is proposed for sustainability and further research works in this direction, is suggested.