Scalable Grid-Based Data Gathering Algorithm for Environmental Monitoring Wireless Sensor Networks

Proper utilization of the available low-power is essential to extend the lifetime of the battery-operated wireless sensor networks (WSNs) for environmental monitoring applications. It is mandatory because the batteries cannot be replaced or recharged after deployment due to impracticality. To utilize the power properly, an appropriate cluster-based data gathering algorithm is needed which reduces the overall power consumption of the network significantly. So, in this paper, a grid-based data gathering algorithm called energy-efficient structured clustering algorithm with relay (EESCA-WR) is proposed. In this algorithm, the grids have a single grid leader (GL) and multiple grid relays (GRs). The count of GRs in a grid is variable based on the geographic location of the grid with respect to the destination sink (DS). By doing this, we ensure that the reduction in power consumption is achieved because of the multi-hop short-distance data communications. Also, the GLs are rotated in the right intervals in hybrid modes to minimize the usage of control messages considerably. A hybrid GL selection policy, a threshold-based GL rotation policy, and the policy of allotting dedicated relay-clusters in every grid make the proposed algorithm unique and better for homogeneous and heterogeneous wireless sensor networks. Performance evaluation of the proposed algorithm is carried out by varying the length of the field, the node-density, the grid-count, and the initial energy. Experimental results show that EESCA-WR is extremely scalable, energy-efficient with a minimum number of control messages, and can be used for large scale WSNs.


I. INTRODUCTION A. WSN FOR ENVIRONMENTAL MONITORING
Environmental monitoring is essential to forecast and control big disasters in the remote areas of the world. Monitoring sediment transport processes in coastal areas [1], monitoring factory environmental quality [2], examining forest fire [3]- [5], predicting the erection of a volcano [6], [7], and forecasting the snow-melt [8] are some of the best examples of environmental monitoring applications [9]. Various sensors are available to sense the physical parameters like temperature, pressure, and vibrations from the environment. Modern sensors are tiny, handy, and compatible with the wireless communication devices. Generally, wireless communication is preferred in the environmental monitoring applications The associate editor coordinating the review of this manuscript and approving it for publication was Zhangbing Zhou .
where a fixed wired communication structure is not feasible. A sensor node (or simply a node) is a device that is capable of gathering sensory information, performing some processing, and communicating with other nodes. When plenty of sensor nodes are deployed in the above-mentioned largearea monitoring fields to make wireless sensor networks (WSNs) [10], obtaining the individual data from all the sensor nodes is not useful because the data from the same region is highly correlated. Also, the nodes which are deployed far from the destination sink (DS) lose their energies soon due to large communication distances. To avoid the redundant data as well as the long-distance communications, a bunch of data from the same region (grid) can be clubbed into a single data which state the aggregated information about the grid. The aggregated data can be sent to the DS. When direct communication happens between a grid leader (GL) and the DS, the GL alone has to communicate the aggregated data to the DS. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This methodology can be used for small-area networks. But, when the distance between the grids and DS is more, the data should be routed via other grids using multi-hop communication. When the routing is dynamic, every time the route should be discovered which needs more number of control messages. Instead, the routing can be done by dedicating some nodes as relays permanently. They won't involve in sensing the information from the fields. They just receive the information from some nodes and send the received information to some other nodes based on the algorithm. Thus, for clubbing the nodes properly and transmitting the beneficial data to the distant DS through the appropriate route, a good clustering-based data gathering algorithm is preferred. A well-developed algorithm is useful for improving the life-span of the WSN since the wireless sensor nodes are supplied with tiny batteries with limited energy. The paper aims to propose a scalable energy-efficient grid-based data gathering algorithm which minimizes the control messages substantially.
The rest of the work is organized as follows: Section II elaborates on the related works and our contributions; Section III describes the proposed algorithm in detail; Section IV presents the extensive simulation results obtained, and Section VI concludes the paper.

II. RELATED WORKS
Evidence of extensive research has been found in the literature to make the network energy-efficient. Among those, clustering is a widely accepted technique and it has good scalability also. The cluster head (CH) selection process is carried out using probabilistic and non-probabilistic ways [11], [12] in the past. Low energy adaptive clustering hierarchy (LEACH) [13] is one of the important algorithms which proposed the probabilistic approach for CH selection. Then, modified versions of LEACH [14] are introduced. They followed the same approach with minor updates. After that, hybrid algorithms like [15], [16] considered the residual energy along with the probabilistic approach. However, in the aforementioned approaches, there is a possibility of selecting the nodes with low residual energy as CHs and the cluster count is variable. This leads to the load imbalance among the nodes. To address this problem, the algorithms [17], [18] selected the CH in non-probabilistic ways by considering the parameters such as residual energy, the node proximity, and the distance between the CHs and DS. Even though the load is balanced and the network lifetime is improved in these approaches, the control overhead is more to make the clusters. Also, the clusters which are located far from the DS lose their connectivity soon. To overcome this problem, some algorithms used single-hop communication for intra-cluster communication and multi-hop communication for the intercluster communication [19]. In these approaches, the CHs in the intermediate regions act as the routers for the distant clusters. However, it increases the load of the intermediate CHs and the lifetime of the network is reduced. Meanwhile, grid-based data gathering algorithms are developed [19], [20] to have fixed clusters. In these approaches also, the data is routed via the CHs. Later, some algorithms like scalable energy efficient clustering hierarchy (SEECH) [21] allotted some nodes as relays to route the aggregated data from the CHs to the DS. But, in this approach, the relay node has to receive multiple messages from their own CHs, and from other clusters which results in the reduction of the network lifetime. All the above-mentioned algorithms are proposed for the homogeneous WSNs. On the other hand, the stable election protocol (SEP) [22] is introduced for the two-level heterogeneous WSNS. In this algorithm, the nodes are supplied with different initial energies to check the performance of the network. Later, the distributed energy-efficient clustering (DEEC) algorithm [23] is suggested for multi-level heterogeneous WSNs. The authors of the SEP, DEEC, and other successive algorithms [24] claimed that most of the algorithms developed for homogeneous networks are not suitable for heterogeneous networks.
Motivated by the literature, in this paper, a grid-based data gathering algorithm called energy-efficient structured clustering algorithm with relay (EESCA-WR) is proposed. The major contributions of this research work are as follows.
• A large rectangular field is selected for the investigation.
It exhibits the algorithm's capability to be utilized in huge-area wireless sensor networks.
• The algorithm is tested by varying the length of the field from 200 m to 800 m. It indicates that the algorithm is extremely scalable.
• Complete useful data percentage (CUDP) is used for analyzing the lifetime of the network. It shows the accurate performances of the algorithm for various sizes of fields.
• A concept of relay-cluster is introduced for balancing the load among the nodes. It is a novel technique to route the data from the grids to the DS easily.
• The GL is rotated in multiple modes. The role of the GL is retained for the same node until an energy-based threshold is reached in mode 1. It is to avoid the reclustering process every round. In mode 2, the GL role is rotated based on the energy threshold. This methodology reduces control messages.
• The algorithm is designed to suit both homogeneous and heterogeneous WSNs.

III. PROPOSED EESCA-WR ALGORITHM
The proposed network field arrangement is shown in Fig. 1.
The nodes are randomly deployed in rectangular fields with dimensions ranging from 100 m × 200 m to 100 m × 800 m. The total number of nodes in the fields, n t is varied from 200 to 1200. The DS is located in the fixed position of M f /2, N f as suggested in [25], where M f , and N f are the dimensions of the field. The field is segregated into a number of grids with the dimensions M g = X f /2, and where d 0 is the distance threshold which is calculated by using (3).

2) ROUTING MECHANISM
The data routing mechanism of the proposed algorithm is shown in Fig. 2. The grid normal nodes (GNs) transfer the sensed data to the GL. The GL aggregates the data and passes to a grid relay (GR). The data from the lower layer grids are passed to DS through the GRs in the higher layers. The relay cluster's count in a grid is decided by the position of the grid with respect to the DS. The grids which are far from the DS have only one GR node. Thus, the GR present in a lower grid transmits the aggregated data obtained from the GL to a GR in the next higher layer grid. The higher layer grids have multiple GRs. All the GRs receive one message from either local GL or the lower layer grid's GR and send one message to either higher layer grid's GR or the DS. By doing this, the workload of the nodes is balanced and the relay nodes are allowed to be in sleep mode most of the time. Also, the nodes which are allotted as GRs act as GRs only throughout the operation. This ensures stable connectivity between the grids and the DS.

3) ASSUMPTIONS
The assumptions made in this work are listed as follows.
• The nodes are stationary after deployment.
• The information is sent to the DS every round.
• The nodes can identify their location coordinates.
• To compare the results with other clustering schemes, it is assumed that each node can transmit its data packet to the DS directly.
• The nodes use power control to adjust the transmit power.

4) NETWORK ENERGY MODEL
The energy of a node is dissipated when the node senses the environmental parameter, transmits the data to other nodes, receives the data from other nodes, and aggregates the received messages from other nodes when acting as a GL. Among these, the sensing takes negligible energy. The energy consumption of a node when transmitting and receiving any data is calculated using the simple first-order radio model which is used in [13]. The energy required for data transmission, E tx is given as: The energy required for data reception, E rx is given as: where k is the number of message bits, d is the distance of communication in meters, E elec is the energy required to run transmit, T x or receive, R x circuitry, fs is the energy consumption in the free space, mp is the energy consumption in multipath, and d 0 is the distance threshold which can be calculated as: (3) VOLUME 8, 2020

5) PROBLEM STATEMENT
The ultimate aim of this research work is to develop a gridbased data gathering algorithm with multi-hop routing which produces an energy-efficient network for gathering the data over a prolonged period. As the sensor nodes are energy constrained, the reduction of the overall energy consumption of the network is essential. In addition to this, the workload should be evenly shared among the nodes to avoid the possibility of the early death of the nodes due to the overburden. Then, the algorithm should achieve good scalability. Finally, a fast response is preferable to reduce the delay in operation.

B. ALGORITHM DETAILS
The lifetime of the network depends on the lives of individual nodes. So, in the proposed algorithm, the load is evenly balanced among the nodes. The lifetime of the network is maximized by • minimizing the transmission distance for GNs.
• minimizing the number of associated GNs with the GL.
• allotting the responsibility for each GR to receive the data from one node only (either from GL of the same grid or from the GR of the lower layer grid) and to send the data to one node only (either to the GR of the higher layer grid or to the DS). Various stages of the EESCA-WR are described in the following sections where i refers the count of the round, j refers the count of the node, l refers the virtual centerline, k opt refers the optimum cluster-count which is derived in (5), and neigh refers to the neighbour nodes. The other parameters used are self-explanatory.

1) NODE INITIALIZATION
In the beginning, the nodes are initialized to have preliminary information like their locations and the distance between other nodes and DS. Then, the average communication distance (ACD) is calculated as follows: where d j→neigh is the distance of j th node to the other nodes and n g is the number of nodes in the grid. The node-level initialization process is given in Algorithm 1.

2) GR SELECTION
After the node initialization process, the m nodes which are closer to the virtual vertical centerline of the field elect themselves as GRs. When p layers are there in the field, the p th layer grid has p number of GRs. In the bottom layers, only one GL and one GR present. So, they associate with each other. if i > 1 then 13: if status i−1 j == GR then 14: Set status i j = GR 15: end if 16: end if 17: end if distance between them. The detailed algorithm for GR selection is shown in Algorithm 2.

3) GL SELECTION
The GL is selected in hybrid modes based on the centrality and the nodes' residual energies as mentioned in [26], [27]. The node-level GL selection process is depicted in Algorithm 3. In mode 1, the node which holds the least ACD in the grid acts as the GL for the initial rounds. The node which has the least ACD is in the center of the grid. So, the communication distance of the GNs gets reduced. In mode 2, the node which has the highest residual energy gets the leadership role every round. This is done to ensure the longevity of the network lifetime.

4) GRID FORMATION
The nodes in a grid take the roles of the GR, GL, and GN based on the algorithm. In every grid, some nodes form the relay-cluster and act as the GRs, one node acts as the GL, and the remaining nodes act as the GNs. After the GRs and GLs are selected and associated with appropriate nodes, the GL does its operation by broadcasting a message, Broadcast GL msg with in ID j 3: end if 4: if status i j = GN then 5: Join with the GL in the grid 6: end if 7: if status i j = GR then 8: Join with the GL in the grid or join with the closest possible GR in the higher layer grid 9: end if GL-MSG including its ID and a secret code to the other nodes in the grid. The GNs send join message, join msg to the respective GLs. Thus the grid is formed with minimum control messages. The operation is described in Algorithm 4. After that, a time division multiple access (TDMA) schedule is created for all the nodes for the data transmission. This avoids the data traffic in the network.

C. METHODOLOGIES USED FOR IMPROVING NETWORK LIFETIME
The main motive of this work is to reduce energy consumption among the nodes to enhance the network's lifetime. Literature results show that the grid-count and the communication distance between the nodes affect the energy consumption of the total network. So, an effort has been taken for getting optimum grid-count and reduced communication distance between the nodes.

1) OPTIMIZATION OF GRID-COUNT
In the proposed algorithm, the grid-count is decided based on the dimensions of the grid. The relay-cluster of every grid should be within the reach of the relay-cluster of higher layer grids. It is done mainly to reduce the communication distance between the GRs. In this work, N g is fixed as N f /2 and M g is scalable to (M f /d 0 /2)m approximately. So, the optimum grid-count, k opt is derived as:

2) MINIMIZATION OF COMMUNICATION DISTANCE
• When grid size is restricted to M f /d 0 /2, N f /2, obviously the maximum distance between a GN and the GL is limited to: • Since the GL is in the middle of the grid in the initial stages, the d GL→GR in the grid is much lesser than the d j→GL . However, in the worst case, the maximum distance between the GL and the GR is limited to (6).
• The distance between a GR to a higher layer GR is limited to:

D. ENERGY CONSUMPTION
The time-lines of the GNs, GLs, and GRs are shown in Fig. 3. The GNs and GRs are active for the required timings and at sleep state for the remaining duration. The GL has to be active throughout the operation since it receives the data regularly from GNs, aggregates the data, and sends the aggregated data to the GR. The overall energy consumption per round is analyzed in the following sections.

1) ENERGY CONSUMPTION IN GNs
The role of a GN is to sense the environmental parameter and send that to the GL. Thus, the energy consumption of a GN is given as:

2) ENERGY CONSUMPTION IN GLs
The GL does the collection of all the information from the GNs, aggregates the data to make a piece of single information, and passes it to the GR for further transmission. Thus, the energy consumption of a GL is given as: where n g is the number of the nodes in the grid.

3) ENERGY CONSUMPTION IN GRs
A GR in the relay-cluster collects the information either from the GL or from a lower layer GR. Thus, all the GRs receive and send only one data per round. So, the energy consumption of a GR in the grid closest to the DS is given as: The energy consumption of a GR in other layer grids is given as:

4) ENERGY CONSUMPTION IN A GRID
In a grid, the GNs sense the physical parameters needed from the environment and send the detail to the GL. The GL receives all the details from GLs and aggregates the data into a single data and transmits it to the nearest GR. The GRs in the relay-cluster collect the information from the GL and a lower layer GR if any. So, the energy consumption of higher layer grids is given as: where GR tot is the number of GRs in a grid. The energy consumption of other layer grids is given as: Even though the relays in the higher layer grids are more compared to the lower layer grids, every relay node receives only one message every round. Hence the energy consumption of all the grids is almost the same.

5) OVERALL ENERGY CONSUMPTION OF THE NETWORK
The energy consumption of the network depends on the number of GRs and k opt . So, the energy consumption of a total network in a round is given as: + GR totL (2kE elec + k fs d to_Higher_layer_GR 2 )} (14) where GR totH is the number of GRs in higher layer grids and GR totL is the number of GRs in higher layer grids.

IV. SIMULATION RESULTS AND ANALYSIS
In this section, the performance of EESCA-WR is compared with LEACH [13], EESCA [26], and SEECH [21] algorithms. LEACH uses a probabilistic threshold to select the CHs. The CHs directly send the data to the DS. Since the CHs are selected randomly, the algorithm does not consider the residual energy of the nodes which results in frequent data packet losses. Also, the CHs directly send the data to DS. Hence, this algorithm is not suitable for large monitoring fields. EESCA selects the CH in hybrid modes. Mode 1 uses the node centrality and mode 2 uses the residual energy as parameters. The data from the lower layer clusters is routed via the CHs of the higher layer clusters. This affects the load balancing of the network when the algorithm is used in 79362 VOLUME 8, 2020 large fields. SEECH uses the residual energy and node degree as the main parameters for the CH selection. As mentioned in section 2, SEECH isolates the relay role from the CH and appoints some nodes for taking that role. But, in this approach, the relay nodes have to receive messages from multiple clusters. It results in a reduction of the network lifetime. Both the LEACH and SEECH rotate the CH every round whereas the EESCA rotates the CH only after the CH loses its 40% of the initial energy. It reduces the control messages used for re-clustering.

A. SIMULATION PARAMETERS
Simulations are carried out in MATLAB R2018a. The simulation parameters used are as follows: Packet size, k = 4000 bits, Transmit energy, E tx = 50 nJ /bit, Threshold distance, d 0 = 87 m, Multipath energy, mp = 0.0013 pJ /bit/m 4 , Free space energy, fs = 10 J /bit/m 2 , and Data aggregation energy, E da = 5 nJ /bit/messsage. The DS is fixed at the location (M f /2, N f ). The width of the field, N f is fixed as 100 m and the length of the field, M f is varied from 200 m to 800 m whereas the number of nodes, n t is varied from 400 to 800 for checking the scalability. The initial energy supplied to the node, E 0 is ranged from 0.25 J to 1 J to evaluate the impact of the initial energy on the lifetime of the network. In LEACH, the communication occurs between the GL and the DS directly which is highly improbable when the length of the field is more than the communication range of the nodes. In this work, the communication range of the nodes is assumed as unlimited for comparison. In SEECH, the internal parameters used are as follows: needed CH candidates K CHC = 8, needed CHs, K CH = 3, needed relays, K R = 10, and specific radius, RNG = 55 m as prescribed in scene 1 of SEECH. To evaluate the network lifetime, First node die (FND) and Last node die (LND) are the common parameters used in most of the algorithms. In addition to those, the parameters quarter node die (QND), half node die (HND), and 90% node die (90%ND) are used in this work. When very few nodes in the same region are alive for many rounds, the LND is more. But, the data obtained from those nodes cannot give the overall information about the field. So LND cannot be a suitable indicator of the effectiveness of the algorithm. So, In addition to the above-mentioned parameters, a new parameter CUDP is introduced to evaluate the performance of the algorithms. The CUDP is described as follows: By using the CUDP, the effectiveness of the algorithm can be tested successfully. When the CUDP is more, the algorithm can be considered as energy-efficient.

1) IMPACT OF THE LENGTH OF THE FIELD
When the width of the field is maintained constant, and the length of the field is increased, the nodes have to send the data to the longer distances. This impacts on the network lifetime. Fig. 4 shows the simulation results to analyze the impact of the length of the field. The simulations are carried out by keeping the following simulation parameters: n t is 200, 400, and 1200, N f is 200 m, 400 m, and 800 m, and E 0 = 0.5 J .
In EESCA-WR, k opt is set as 10, 20, and 40 respectively. From the results, it can be observed that EESCA-WR is doing better than other algorithms. EESCA-WR achieves approximately 1.8, 1.3, and 0 times more CUDP for VOLUME 8, 2020 Scene 1, 14.2, 2.1, and 1.6 times more CUDP for scene 2, and 892, 646, and 10.1 times more CUDP for scene 3, compared to LEACH, SEECH, and EESCA. The FND is almost equal in SEECH and EESCA-WR in scene 2. Similarly, EESCA gets better CUDP in Scene 1. These results indicate that SEECH and EESCA are doing better when the field is small. Also, in scene 1, it is observed that LEACH, SEECH, and EESCA perform reasonably well. But, in scenes 2 and 3, when the area of the field is increased, only EESCA-WR works better compared to other algorithms.

2) IMPACT OF THE NODE DENSITY
When n t is increased in the same area of the field, the GLs have to receive more messages from the GNs within the grid. So, it reduces the E res of the GLs quickly. Also, the rotation of the GLs takes place frequently and more control messages are needed in mode 2. To evaluate the impact of node density on each approach, the parameter n t is varied from 400 to 800. Fig. 5 shows the results of each approach for the simulation setting: n t is 400, 600, and 800 respectively, M f = 400m and E 0 = 0.5 J . In EESCA-WR, k opt is set as 20. The results show that FND is significantly more compared to other algorithms. EESCA-WR achieves approximately 14.2, 2.2, and 1.3 times more CUDP in all the scenes, compared to LEACH, SEECH, and EESCA.

3) IMPACT OF THE INITIAL ENERGY
The impact of the initial energy is tested by providing E 0 of 0.25 J, 0.5 J, and 1.0 J for the nodes by keeping other parameters constant. The performance of the algorithms is shown in Fig. 6. The simulation parameters set are as follows: n t is 400, M f is 400 m and E 0 = 0.25 J , 0.5 J , 1 J . In EESCA-WR, k opt is set as 20. Results show that EESCA-WR achieves a better network lifetime. On average, EESCA-WR achieves approximately 14.2, 2.1, and 1.6 times more rounds compared to LEACH, SEECH, and EESCA. But, when E 0 is varied, the performance is not affected because the network lifetime is linearly proportional to the E 0 .

4) IMPACT OF THE GRID-COUNT
When k opt is more, the intra-grid distance for communications is reduced. But, in turn, the control overhead is more. On the other hand, When k opt is less, the GLs have to receive more information from normal nodes and the inter-grid distance is more. To compute the impact of k opt on EESCA-WR, the sensor field is partitioned into 12, 16, and 20 grids. In this comparison, k opt = 20 is more appropriate for EESCA-WR for the following simulation parameters: n t is 400, M f is 400 m and E 0 = 0.5 J . In EESCA-WR, k opt is set as 12, 16, and 20. Fig. 7 shows the simulations for this comparison. The expression for deriving the optimum grid-count is given in (5). According to that, having a shorter communication distance is a better way to achieve less energy consumption.
In EESCA-WR, the GNs only need to transmit the data locally to their GL located at a short distance. The energy consumed by them is thus reduced and the distance between the GL and the GR is also less.

5) AVERAGE RESIDUAL ENERGY
The average residual energy after 500 rounds for the algorithms is given in Fig. 8. The simulation parameters used are: n t is 400, M f is 400 m, and E 0 = 0.5 J . In EESCA-WR, k opt is set as 20. From the results, it can be observed that in EESCA-WR, the consumption of energy is evenly distributed among the nodes compared to other algorithms. Similarly, the average energy consumption is very less compared to LEACH, SEECH, and EESCA algorithms.

6) HETEROGENEITY TEST
The nodes are equipped with different initial energies in heterogeneous WSNs. To test the heterogeneity, the nodes are    improved when the heterogeneity is introduced. Fig. 9 and Fig. 10 show the impact of heterogeneity in network lifetime.

V. CONCLUSION
The limited energy provided with the sensor nodes should be carefully utilized to prolong the lifetime of the WSNs used for environmental monitoring applications. To achieve a good network lifetime by reducing the overall energy consumption, in this work, an energy-efficient structured clustering algorithm with relay (EESCA-WR) is proposed. The proposed algorithm incorporates the following aspects: • An optimum grid-count is kept to reduce the communication distance between the nodes.
• All the nodes have an equal chance to serve as GL to ensure load balancing.
• A fixed relay-cluster is maintained to establish the constant route path in the network. This arrangement reduces the control messages considerably and ensures fast operation. The conventional algorithm LEACH doesn't use the relay concept and performs better for smaller fields. EESCA uses fixed clusters and performs well for the small fields. On the other hand, SEECH uses the relays for routing the data from CH to DS but it overloads the relays and does reasonably well for semi-large sized fields. The proposed algorithm EESCA-WR uses the relays in a better way and it is suitable for both small and large fields because of its characteristics. Simulation results show that EESCA-WR performs well compared to LEACH, SEECH, and EESCA in terms of four parameters, network lifetime, scalability, average energy consumption, and fast response. Moreover, the EESCA-WR is tested with heterogeneous nodes and performs well. Thus, the proposed algorithm is proved to be suitable for large-scale environmental monitoring applications. This work can be extended for multilevel heterogeneous networks also and can be tested using various network simulators in the future.