DRTP: A Disruption Resilient Hop-by-Hop Transport Protocol for Synchrophasors Measurement in Electric Transmission Grids

In a modern electric power transmission grid, the phasor measurement unit requires a reliable transport of its sampled statistics with a low end-to-end failure rate (EEFR) to ensure the accuracy of the grid state estimation. However, EEFR can be deteriorated by packet losses due to multiple link disruptions in the primary forwarding path (PP). To address that, we investigate a novel disruption resilient transport protocol (DRTP) enabling hop-by-hop retransmission utilizing the redundant subpaths (RSPs) available for the PP to increase reliability. It addresses the new distributed collaboration issue under multiple link failures to avoid cache mismatching. These have not been considered by the existing approaches. The DRTP was evaluated in the ndnSIM simulator through both the typical and general routes that are constructed from real transmission grids. The numerical results demonstrate that it has a significant advantage in reducing the EEFR with a low end-to-end delivery time under serious link disruptions.


I. INTRODUCTION
In modern electric power transmission grids, digitalized communication and measurement infrastructure has been fully equipped to support grid operations [1], [2], [3], [4], [5]. Generally, for each critical substation of the grid, the phasor measurement units (PMUs) are deployed to sample the electrical statistics of transformers and buses that are synchronized in time [2], [6]. Here, the sampled statistics are contained and encapsulated in the packets stamped with the time obtained from the global positioning system. Therefore, the communication process is well known as the The associate editor coordinating the review of this manuscript and approving it for publication was Rentao Gu . synchrophasors measurement [6]. At runtime, the packets are transported from a PMU over a communication network to the phasor data concentrator (PDC) at a substation on a fixed frequency [7]. Such transport requires a low end-to-end failure rate (EEFR) [8], to ensure the precision of the power grid state estimation [9]. It is due to the fact that, otherwise, a higher EEFR generates the measurement incompleteness that causes the inaccuracy [10], e.g., cost 25 to 180 billion US dollars annually to the US economy [11], which can lead to incorrect grid dispatch actions [12]. However, the EEFR can be deteriorated by packet losses caused by disruptions on multiple links of the primary forwarding path (PP) in the network, e.g., up to 14% in Brazil [13]. The disruptions can be mainly caused by the following factors: (a) large-scale cascading failures on multiple power lines [14] of which each ground one usually encapsulates a communication link [3], [15], [16], (b) natural disasters on links extending over a large geographical area [14], and (c) network congestion [17].
To reduce the EEFR, the current promising approach is hop-by-hop retransmission control for lost packets adopting in-path caching [18], [19], [20], [21], [22]. Recent studies have demonstrated that this type of control can outperform existing end-to-end transport protocols [20], [22]. In the control process, the intermediate hops of the PP cache received packets during forwarding. Meanwhile, a hop retransmits the lost packets by finding cached copies of the packets from its upstream hop. However, they have not considered exploiting redundant subpaths (RSPs) for the PP to improve the reliability of the upstream link of the PP. Each RSP starts from the current hop to its upstream one, all in the PP, and is disjoint from each link between the source and the current hop. Such RSPs are plentiful in the network (see Section II-B) due to its redundant topological structure in meshes and rings [23]. In addition, the named data networking (NDN) as a popularized future network architecture, provides an efficient caching capability in the network [24]. Previous work has demonstrated that NDN can be used to support communications of the synchrophasor measurement [25].
To address the aforementioned challenge, in this paper, we make the major contributions as follows. • We investigate a novel disruption resilient transport protocol (DRTP) enabling the hop-by-hop retransmission with the RSPs, based on NDN. The RSPs are integrated into our introduced routing model of the multipath subgraph (MPSG) that is easily generatable for the network topology of the grids (see Section II-B). Meanwhile, we identify the non-trivial distributed collaboration issue during retransmission under multiple link failures, which can result in the cache mismatching to deteriorate the EEFR. To our knowledge, the issue has not been considered in the existing hop-by-hop approaches.
• We present a new solution to cope with the above collaboration issue. The primary idea of the solution is to appropriately select the retransmission opportunity to avoid cache mismatching in the best effort by exploiting the fixed frequency. It can achieve much reduced EEFR with a bounded end-to-end packet delivery time (EEDT) under recursively controlled collaboration.
• We comprehensively evaluate the performance of DRTP under the ndnSIM simulator [26] through both the typical MPSG and the general MPSG constructed from real transmission grid topologies. The numerical results show that the DRTP has a significant advantage in reducing EEFR compared to the existing approaches with a low EEDT under serious link disruptions. The remainder of the paper is organized as follows. Section II details the preliminaries of the DRTP and related work. Section III presents the design of DRTP and the issue. Section IV details the solution to the issue. Section V evaluates the performance of DRTP. Section VI gives conclusive remarks and future work.

II. PRELIMINARIES AND RELATED WORK
The preliminaries on the DRTP involving models as well as related work are discussed in detail as follows.

A. COMMUNICATION MODEL
The communication network of the transmission grid can be modeled as a directed graph of (R, L) according to [6]. Here, each r i ∈ R is the i-th router deployed at a critical substation, and each l j ∈ L is the j-th link connecting two neighbored routers. The maximum delay of the link is the sum of the following components: (a) the maximum transmission delay that is linearly related to its physical distance, (b) the maximum queueing delay of the link that can be estimated by the maximum sizes of its input and output buffers using the Little's law [27], and (c) the maximum total processing time of its two connected routers. Formally, the maximum total delay and the queueing delay of the path of p starting from r a to r b are denoted as τ p all (a, b) and τ p queue (a, b), respectively. For simplicity, each PMU and the PDC directly connect to their corresponding routers via an indoor link with complete reliability and negligible link delay [6]. In addition, the functioning of each PMU and router in the network is assumed to be without any failure for simplicity.
In the network, a PDC communicates with a PMU to receive synchrophasor measurement results in a periodic manner. The PMU is a device that produces data including phasor, frequency, and the rate of frequency changes measured from transformers or buses in the grid. In addition, a PDC is a device that combines data from several PMUs. Fig. 1 depicts the communication pattern of the widely used IEEE C37.118.2 standard [7] that takes the following steps: (i) A PDC sends a Command message to a PMU to ask the PMU to feedback a CFG frame; (ii) after receiving the Command, the PMU replies a Configuration message containing the capability of PMU in reporting the synchrophaor measurements and the data frequency; (iii) after receiving the Configuration, the PDC sends a Command to turn on data transmission; (iv) after receiving the Command, PMU sends the continuous Data frames containing the sampled statistics to the PDC at the minimum frequency of f Hz, i.e. with a sending interval no greater than 1/f ; and (v) according to an operation need, PDC can send a Command to turn off the data transmission. Furthermore, in step iii, the sequence of packets can be modeled as (ξ 1 , ξ 2 , · · · , ξ δ , ξ δ+1 , · · · ), where δ is an incremental data identifier (ID). For instance, the ID can be counted and tagged by the gateway of the PMU.
In the network, the maximum extent of disruption to a link can be measured through the link loss rate (LLR). The LLR is the division between the number of total lost packets on the link and the number of all packets sent by the source router of the link. Meanwhile, EEFR is the average packet loss rate at the PDC per second. Thus, our aim in this paper is to reduce the EEFR during the transport of the PMU data packets under certain LLRs on all their related links in the network.

B. ROUTING ON MULTI-PATH SUB-GRAPH 1) ROUTING MODEL
The MPSG consists of a PP from a PMU to the PDC and its RSPs which can be formulated as follows. The PP starting from r d 1 to r d b is a sequence of router IDs, i.e., p prim = (d 1 , · · · , d s , · · · , d k , · · · , d b ). Meanwhile, for each r d k of the PP, where k ≥ 2, a set of RSPs at r d k is established that connects to its one or multiple different upstream routers ({r d s }) of the PP. Furthermore, each RSP is a sequence of router IDs, i.e., p rsp = (d k , · · · , d s ). In addition, each link of each RSP is link-disjoint from the r d 1 to r d k of the PP. The disjointness is used to make the packet losses on the upstream links of the PP independent of the RSP links, in order to increase the reliability of the retransmission control. For example, Fig. 2 shows an ordinarily MPSG that is manually constructed from the IEEE 300-bus topology that is modeled according to a real-world transmission grid with 300 nodes and 409 links [28]. In the MPSG, the PP starts from node #138 to node #100, where the two nodes directly connect to a PMU and the PDC, respectively. In addition, each node of the PP connects to its upstream ones of the PP through multiple different RSPs. Say, the node #104 connects to the nodes #136 and #138 via the RSPs of (104, 135, 137, 136) and (104, 103, 101, 102, 138), respectively. Meanwhile, the PP is (138, 136, 104, 103, 100).

2) MPSG GENERATION
The MPSG can be easily generated from the above communication network. For example, Alg. 1 presents an algorithm to generate the MPSGs starting from the current router r i to the rest of the network (GenMPSG). Furthermore, for each MPSG, each router in the PP of the MPSG contains at most β RSPs. GenMPSG is designed based on the classical Dijkstra's algorithm that uses a backtracking technique. GenMPSG first computes a PP, and then generates each RSP from an intermediate node to its upstream one, all in the PP, in the following steps. In line 2, it computes a vector recording the previous router IDs of the PP using the Dijkstra algorithm as H. In line 3, it initializes an empty map from each destination node with respect to r i to its corresponding MPSGs as γ all . In lines 5 to 23, it computes an MPSG with the source r i for each destination r j . In detail, first, in line 5, it computes the PP of the MPSG from r i to r j according to H . Second, in line 6, it initializes an empty map from each router in the PP to a set of RSPs of the router. Third, in line 7, it extracts all links in bi-directions of the PP as L PP . Fourth, in lines 9 to 21, it computes at most β RSPs for each router in the PP, i.e., p prim [g]. Furthermore, it computes each RSP as follows. In line 9, it initializes a set to contain all upstream routers of p prim [g] in the PP as P . Meanwhile, it creates a set L RSP to contain all links in all RSPs starting from p prim [g]. Afterwards, in lines 11 to 20, it computes an RSP from p prim [g] to each upstream router in the PP via the following steps:(i) In lines 11 to 13, it creates a virtual graph of G = (R, L ). Here, R is added with a virtual router r vir . In addition, the virtual router connects to each upstream router in the PP through a virtual link with a zero total delay in L vir . Meanwhile, L is L subtracted from each link contained in L PP and L RSP . The computational complexity of GenMPSGs is analyzed as follows. First, we analyze the execution time complexity as follows. Lines 2 and 14 are all in O(|L| + |R| · log|R|) for a Dijkstra algorithm implemented using the Fibonacci heap. Lines 3, 5, 7, 11∼13 and 15∼19 are all in O(log|R|). Lines 6,9,20,23, and 25 are all in O(1). Hence, the total time complexity is in O(|L| + |R| · log|R| + |R| · (log|R| + 1 + log|R|·(1+β ·(|L|+|R|·log|R|+log|R|+1)))+1). This can be simplified into O(|R|·log|R|+β·|R| 2 ·log 2 |R|+β·|R|·log 2 |R|), due to |L| = |R| · log|R| in a power transmission grid according to [29]. Second, we analyze the space complexity as follows. Lines 2 and 14 are in O(|R|). The size of the memory to hold all MPSGs required by lines 5, 6, 15, 20, and 23 is in O(|R|·log|R|·β). Lines 7 is in O(log|R|). The memory allocated for creating the virtual graph in lines 11 to 13 is in O(log|R|) which is then de-allocated in line 16. The memory for L RSP used in lines 9, 18 and 19 is in O(β · log|R|). Hence, the overall space complexity is in O(|R| + β · |R| · log|R| + (β + 1) · log|R|).

C. RELATED WORK
The existing hop-by-hop retransmission control approaches can be classified into two categories as follows: (i) The sender-side designs. Each hop retransmits the forwarded packets unacknowledged from its downstream neighbor hop. K. Su proposed a transport protocol named MFTP [18] to reduce the EEFR. In MFTP, the downstream hop acknowledges the current hop after receiving a control message corresponding to each transmitted chunk (called CSYN), where the message is followed by each received packet. Later, to improve the MFTP, Z. Wang developed a rapid and reliable transport mechanism (R 2 T) [19]. A direct acknowledgment is triggered when a Data packet is received to avoid the CSYN sending. However, the above sender-side designs cannot utilize the RSPs, due that the downstream hops cannot cache the lost packets, which cannot deal with full disruption for any in-path link in networks.
(ii) The receiver-side designs. Each hop requests its upstream ones to feedback the cached packets. J. Chen proposed the adaptive transmission protocol (SDATP) [20], [21]. In the SDATP, each in-path switch retransmits the γ all is a map from each destination node to its MPSG 4: for all r j ∈ R \ {r i } do 5: p prim = Convert(H, r i , r j ) 6: X is a map from each PP node to its RSPs 7: for all g ∈ [2, 3, · · · , |p prim |] do 9: for all k ∈ [1, 2, · · · , β] do 11: L vir = {(r w , r vir ) two |r w ∈ P } 12: 14: 15: if p rsp = ∅ then break 18: end for 22: end for 23: end for 25: return γ all 26: end function lost packets detected according to the packet disordering and inter-arrival time from its upstream neighbored one. Meanwhile, J. Garcia-Luna-Aceves proposed the Internet transport protocol (ITP) based on NDN. In ITP, an in-path router retransmits a lost Data through resending its Interest [22]. However, they have not utilized the RSPs in retransmission. Moreover, they have not concerned with the arrival timeout computation in dealing with the distributed collaboration issue, which increases EEFR.
Hence, the above sender-side and receiver-side designs have not considered to design reliable hop-by-hop retransmissions via exploiting RSPs. Meanwhile, the receiver-side designs have not addressed their distributed collaboration issue (see Section III-D). They can deteriorate their EEFRs under serious links disruptions (see Section V-C).
In addition, the hop-by-hop retransmission control design can be eased by the popularized NDN architecture [24], [26]. In the NDN, a consumer receives the Data packet with a prefix name stored at the producer via sending an Interest packet with the prefix along a path. Meanwhile, the in-path router can efficiently cache its received Data into its local content store (CS). The CS is with a limited size configured by the operator, where a certain large size can reduce cache mismatching. The interactions between routers can be flexibly redefined through their forwarding strategies.

III. PROTOCOL DESIGN AND ITS ISSUE
In this section, we present the overall design and collaboration of DRTP to identify the issue during its retransmission.

A. OVERALL DESIGN
The proposed DRTP is designed as a forwarding strategy for the NDN router [24] based on the aforementioned MPSG routing for the transport from a PMU to the PDC. Concretely, we elaborate the design in the following three aspects:

1) MESSAGE DESIGN
DRTP messages are completely designed in NDN to carry packets during the communication process in IEEE C37.118.2 (see Section II-A). Figs. 5(a) and 5(b) present the formats of Interest and Data packets in DRTP, respectively. At the low level, each message is a collection of type-length-value (TLV) encoded in a variable size format. The TLV consists of the following fields: (a) the type in 1 octet, (b) the length in 1 octet, and (c) the value in a length indicated by the length field. Here, some TLVs may contain sub-TLVs and each sub-TLV may also be further nested [31]. The entire message roots from a TLV named LpPacket as the link adaptation layer version 2 [32]. The LpPacket consists of a reserved tag in 3 octets, our customized tag in 8 octets, and the fragment to carry Interest or Data. The customized tag contains two information as follows: (a) an MPSG ID to differentiate the communication pair, and (b) a path ID to identify a PP or RSP to follow in forwarding on the MPSG.
At the high level, the Interest contains a name of variable length, CanBePrefix indicating the name as a prefix, nonce and Interest life cycle time. Meanwhile, the Data contains a name, meta information for the content type, a content of variable length to hold a payload, and signature for the Data. The signature is generated and verified by the producer and the consumer, respectively. Here, the maximum length of the content is the MTU subtracted by the length of the name and link layer as well as the message overhead of 312 bytes. In comparison, the length of the Data frame of IEEE C37.118.2 is determined by the total number of signals including phasors, analog values, and digital status words of each monitored PMU. Therefore, such encapsulation in NDN has a drawback in the increased message overhead, which can reduce the signals to be carried, in comparison to IEEE C37.118.2. Furthermore, we extend the Data packet into the four messages by exploiting their names as follows: • The Capsule is used to carry the packet in its payload with a data ID designated by the producer; • The Request contains a retransmission request for the data IDs of the detected lost Capsules; • The Retran is used to carry a retransmitted payload that is marked with a data ID, where an empty payload indicates a mismatch during a retransmission;  • The Report indicates the data IDs of the detected lost Capsules that have been requested for retransmission by an upstream router in the PP.
• The Control mainly carries the messages in IEEE C37.118.2 in its payload as follows: (a) the Command with the sending CFG frame and the data transmission off, as well as (b) the Configuration with CFG. At the same time, the Interest packet is exploited to carry the Command in IEEE C37.118.2 to turn on the data transmission. For convenience, a message with data ID of δ is termed the δ-th one throughout the paper.

2) CACHING POLICY
We modify the NDN forwarding daemon [24] to allow the router to only cache the Capsule message that is with the path ID tag to the PP. In other words, the router will not cache the Capsule with an RSP path ID, Request, Retran, Report, and Control that are not necessary, so that the cache utilization can be more efficient.

3) COMMUNICATION PATTERN
Using the above messages and caching policy in DRTP, Fig. 6 shows the communication pattern that takes the following steps: (i) A PDC sends Control to a PMU to request a CFG frame; (ii) the PMU feedbacks CFG via a Control; (iii) the PDC sends an Interest to the PMU to turn on data transmission; (iv) after receiving the Interest, the PMU consecutively sends the results of synchrophasors to the PDC at the minimum frequency of f ; and (v) the PDC sends Control to turn off the data transmission. In DRTP, the PDC only needs to send one Interest packet during the data transmission from the PMU to the PDC. This can reduce the Interest overhead compared to the existing work [25]. Steps iii to iv are termed the packet delivery process. While in the process, DRTP needs to perform the hop-by-hop retransmission control process to ensure the reliability in dealing with link disruptions. Next, we detail the two processes and then identify a new distributed collaboration issue during the retransmission control process.

B. PACKET DELIVERY PROCESS
Concretely, Fig. 7 shows the packet delivery process taking in the following steps. First, the consumer sends an Interest to a producer along with the PP. Then, after receiving the Interest, the producer delivers a sequence of Capsules with their corresponding automatically increment data IDs to the consumer along with the PP (see the black arrows). Here, each Capsule encapsulates the sampled packet sent from the PMU. Meanwhile, after receiving the Capsule, each router of the PP first checks if the Capsule has been forwarded once or not by the router itself. Once positive, the router drops the Capsule. Otherwise, it caches the Capsule into its local CS and forwards the Capsule to its next hop along with the PP. Lastly, the consumer efficiently reorders all received Capsules in sequence according to their data IDs using a priority queue. The queue has a limited size that can be determined according to the maximum span of the outof-order Capsules received by the consumer. In detail, the consumer de-queues each Capsule in the following manner. If the Capsule is with its data ID consecutive to its previously received one, the Capsule is directly de-queued. Otherwise, the Capsule is first waited for a maximum time for the arrivals of the nonconsecutive Capsules and then de-queued them. Thus, the consumer decapsulates each de-queued Capsule into a packet and forwards the packet to the PDC. Fig. 8 presents the steps of the retransmission control carried out by each router of the PP, e.g., the green circle, as follows:

C. RETRANSMISSION CONTROL PROCESS
(i) The router detects the data IDs of the lost Capsules if they satisfy one of the following criteria: a) non-consecutive to those of the received ones; and b) expected to be arrived but exceeding the arrival timeout of the router. If the data IDs are contained in a Report previously received from its upstream routers of the PP, the following steps will not proceed. It is due that they have been retransmitted by any upstream router. (ii) The router sends the data IDs of the lost Capsules to all its downstream routers in the PP via a Report (see the purple arrows). The Report makes the downstream router skip its retransmission for the data IDs in its step 1. Thus, only the router first detects a lost Capsule retransmits the lost Capsules. In addition, the same Reports are sent ζ times, to ensure the synchrony of the retransmission state.
(iii) The router sends the same Requests with the IDs of the lost Capsules to all its upstream ones in the PP, respectively (see the orange arrows). The sending is along with both the upstream link of the PP and all RSPs of the router.
(iv) When receiving a Request, each upstream router first finds a copy matching with each requested data ID in its CS. Then, if the copy is found, each upstream router simultaneously sends each copy as a Retran to the router along with its incoming path in reverse (see the green arrows). Otherwise, the upstream one sends an empty Retran to indicate a mismatch during the current retransmission.
(v) The router decapsulates the received non-empty Retran into a Capsule. If the Capsule has not been forwarded, the router forwards it to its next hop in the PP (see the related black arrows). Otherwise, the Capsule is dropped to prevent duplicated forwarding. If the Retran arrival is in a timeout, steps iii to iv will be repeated at most η times until success. The timeout is the maximum round-trip time of both the upstream link of the PP and all RSPs of the router.
Overall, the aforementioned steps iii and iv all utilize the RSPs to enhance the upstream link reliability during retransmission. In addition, we estimate the complexity of the message sent and received by each router of the PP in delivering a Capsule (MSR) as follows. Steps ii, iii, iv, and v require O(ζ ), O(η), O(η) and O(1) messages, respectively. Hence, the complexity is in O(2 · η + ζ + 1).

D. DISTRIBUTED COLLABORATION ISSUE
Recall that step i above needs to find the Capsule in the arrival timeout state, which is non-trivial. It is because the timeout needs to be computed in addressing the distributed collaboration issue when there are multiple links in the PP failed. The issue can increase the EEFR due to possible cache mismatching during the retransmission. It is illustrated through a simple example shown in Fig. 9, where r d 1 sends a Capsule to r d b along with a PP from r d 1 to r d b . At first, r d k−1 detects a lost Capsule (see a red cross) by waiting for its own timeout for the Capsule arrival. Then, r d k−1 sends a corresponding Request for the Capsule in the timeout to r d k−2 in upstream. After receiving the Request, r d k−2 retransmits a matched Capsule to r d k−1 . However, the retransmitted Capsule is lost on the link of (r d k−1 , r d k ) (see a red cross). Hence, next, r d k also waits for a timeout that can be too short and then sends a Request to r d k−1 . After receiving the Request, r d k−1 has not received the corresponding Retran and hence, cannot find a copy in its local CS, i.e. cache mismatching, and no copy is sent to r d k . Finally, r d b cannot receive the retransmitted copy, which degrades the EEFR. Furthermore, the example can be extended to a general case when a Capsule is lost on multiple links in the PP. The extension can be realized via combining any two links of them that are next to each other, i.e., (r d k−2 , r d k−1 ) and (r d k−1 , r d k ).
Therefore, an obvious solution to the distributed collaboration issue above is to find a suitable arrival timeout at each r d k of the PP. With such timeout, the Request can be sent by r d k to r d k−1 just after the arrival of the requested payload possibly coming from a Capsule or nonempty Retran at r d k−1 . In other words, each r d k needs to wait for r d k−1 in upstream for the completion of its retransmission in a recursive manner, which is termed as the recursiveness. Thus, such a Request will not cause the cache mismatch at r d k−1 , which can reduce the EEFR.

IV. COMPUTATION ON ARRIVAL TIMEOUTS
To address the distributed collaboration issue of the DRTP mentioned in Section III-D, we make two assumptions related to the retransmission control process of the DRTP as follows: (i) Each router of the PP has been successfully received at least the same Capsule. Otherwise, the router will stop its successive retransmissions, due to no reference time to trigger the arrival timeout. (ii) At least a router in the PP can successfully retransmit the lost Capsule in η times. Based on the assumptions, we identify and then compute the corresponding arrival timeouts under different retransmission situations as follows.

A. IDENTIFYING ARRIVAL TIMEOUTS
We analyze the expected Capsule and its arrival timeout in step i of the retransmission control process in Section III-C as follows. The retransmitting of the δ-th lost Capsule can be in either a success or failure situation. If the situation is successful, the router expects the next δ next -th Capsule that is, the nearest and untransmitted one next to the previously received one without being contained in the previously received Reports. This is to avoid duplicate retransmissions. Otherwise, the router expects the lost δ-th Capsule caused by a failed retransmission.
Furthermore, the different arrival timeouts under the two situations with respect to the recently received δ-th DRTP message are specified, respectively, as follows: (a) the maximum Capsule inter-arrival time (CIAT) for the δ next -th Capsule, and (b) the maximum Capsule delivery time (CDT) between the new δ-th Capsule sent by its upstream router. The DRTP message received above can be identified by exploiting its receiving exclusiveness during retransmission as follows. Each router of the PP can exclusively receive at most one of the Capsule, non-empty Retran, and empty Retran with their data ID of δ. In addition, the router can exclusively receive at most one of the Report, the nonempty Retran, and the empty Retran containing their data ID of δ. This is because when the δ-th Capsule or the δ-th Report is received in step i, the Requests will not be sent in step iii. Hence, no nonempty or empty Retran will be received. With the considerations on such exclusiveness, Table 1 summarizes the above correspondence relationships of CIAT and CDT in different conditions for the received messages.

B. CIATs FOR SUCCESS SITUATION
Specifically, in Table 1, the two conditions for computing the CIATs under the success condition are elaborated as follows.

1) CONDITION #1
This happens when only the δ-th Capsule is received without and with the δ-th Report. Fig. 10 depicts the condition for the routers of r d k , r d k+1 and r d k+2 that should keep the collaboration recursiveness as follows. Suppose that the δ nextth Capsule is lost on the link from r d k−1 to r d k . Thus, r d k detects the loss after CIAT 1 and issues the Request for at most η times. Meanwhile, the CIAT 1 of its downstream router of r d k+1 is set large enough to ensure that the Request arrives at r d k just before δ next -th Retran is feedback to the r d k . Similarly, the CIAT 1 of r d k+2 should be set big enough for r d k+1 . CIAT 1 for each router of r d k in the PP is actually the the time difference between the expected Capsule of ξ next and the previously received Capsule of ξ , denoted as t k (ξ next ) − t k (ξ ). The difference satisfies the inequation given in 1 and is also proven in Th. 1.
Theorem 1: The time difference of t k (ξ next ) − t k (ξ ) satisfies the inequality given in Eq. (1). Wherein, θ[k] is given in  Eq. (2), and t k (ξ ) is the arrival time of ξ at r d k .
(3) demonstrates that t k (ξ next ) is no greater than the sum of the following times: (a) t k−1 (ξ next ), (b) the maximum time to forward the ξ next from the r d k−1 to the r d k over the PP, denoted as the τ PP all (d k−1 , d k ), and (c) the request timeout of θ[k] computed through Eq. (2), that is, η times of the request timeout at r d k , where the request timeout is the sum of all delays of the upstream link of the PP and its RSPs. Meanwhile, Eq. (4) shows that t k (ξ ) is no less than the sum of the following times: (a) t k−1 (ξ ), and (b) the total delay from VOLUME 10, 2022 r d k−1 to r d k subtracted by its queueing delay over the PP. Thus, by subtracting Eq. (3) by Eq. (4) on both sides, we can obtain the CIAT between ξ and ξ next at r d k in Eq. (1). Therefore, the proof is made.
According to Th. 1, CIAT 1 is computed according to Eq. (5). It is due to the fact that the CIAT at the consumer is exactly the period of the PMU traffic multiplied by the difference between δ next and δ, i.e. t 1 (δ next ) − t 1 (δ) = (δ next − δ)/f . In addition, there cannot be any retransmission performed between the successfully delivered δ-th and the unarrived δ next -th Capsule, which yields θ[m] = 0. Furthermore, by extending t k−1 (ξ ) to t 1 (ξ ) on the right-hand side of Eq. (1) in an iterative manner, CIAT 1 is given in Eq. (5). Thus, the recursiveness is maintained.

2) CONDITION #2
It happens when a router only receives the nonempty Retran, meaning that the retransmission is only performed by the router. The condition is depicted by Fig. 11. Suppose that each router in the PP has received the (δ − 1)-th Capsule, and the δ ∼ (δ + 2)-th Capsules are all lost on the link of (r d k , r d k−1 ). Thus, after receiving the nonempty Retran, r d k detects the lost Capsule according to the CIAT computed through 6. Here, drift for the δ-th Capsule at r d k−1 is the time difference between the Capsule arrived from r d k−2 and the δ-th Request arrived from the r d k . Note that drift is piggybacked to r d k through the δ-th Retran. In addition, retran for the δ-th Capsule at r d k is the time difference between the sending of the δ-th Request and the receiving of the δ-th Retran. This is because the time between the δ-th Capsule and (δ + 1)-th Capsule is just CIAT 1 under the recursive collaboration process. Therefore, CIAT 2 is CIAT 1 subtracted by drift and retran and one round of the transmission delay between r d k and r d k−1 , which is Eq. (6). Furthermore, the above subtraction is used to minimize the unnecessary waiting time for the arrival of retransmission for each lost Capsule when there are consecutive lost Capsules. Hence, the above process satisfies the required recursiveness. Table 1, the other two conditions for computing the CDTs under the failure situation are detailed as follows.

1) CONDITION #3
This happens when the router only receives the δ-th Report, which means that all corresponding retransmissions have been performed by its upstream routers. This is depicted by Fig. 12 for the routers of r d k and r d k+1 . The two routers have lost the δ-th Capsules that were retransmitted from r d k−1 and r d k , respectively, which means that the retransmissions fail. Hence, the two routers decide on the lost δ-th Capsule according to the corresponding CDTs relative to their δ-th Reports that are sent from r d k−1 and r d k , respectively. Thus, based on Th. 1, the CDT from an upstream router of r d s to r d k can be computed according to Lem. 1. Thus, the CDT satisfies the recursiveness. Lemma 1: The CDT for retransmitting of the ξ -th lost Capsule from an upstream router of r d s at the current router of r d k of the PP can be computed through Eq. (7).
Proof: According to Eq. (1), we can obtain Eq. (8). In detail, we first extend the right-hand side in an iterative way from t k−1 to t s . Then, consider that under the collaboration recursiveness, the maximum time between the latest Capsule and the next one at r d s cannot exceed the maximum retransmission time at r d s , that is, t s (ξ next ) − t s (ξ ) ≤ θ[s]. Using the inequation, we can see that the right hand is no greater than τ PP queue (d s , d k ) + s≤c≤k θ[c]. Furthermore, the CDT for ξ next can be considered as t k (ξ next ) − t k (ξ ) − θ[k], where the subtraction for θ[k] is due to the time required to perform the retransmission at r d k needs to be excluded. Hence, the CDT is computed in Eq. (7) by making t k (ξ ) = 0 and replacing ξ next with ξ in Eq. (8).

2) CONDITION #4
It happens when the router only has the empty δ-th Retran. This means that the δ-th Request is sent too earlier to cause a cache mismatch due to the Report loss. The condition is depicted in Fig. 13 for the routers of r d k , r d k+1 and r d k+2 of which each detects the lost δ-th Capsule according to its CDT. The CDT can be computed using CDT k (d 1 ) according to Eq. (7). The reason is that the router is unaware of which upstream router has retransmitted the lost Capsule. Hence, the upstream router is assumed as r d 1 to avoid the earlier request, which maintains the recursiveness. Meanwhile, CDT also causes a longer retranmission time. Hence, the condition should be avoided as much as possible by transmitting of the duplicated Report for ζ times.

D. DISCUSSIONS ON DISTRIBUTED COLLABORATIONS
With the assumptions in Section IV, the DRTP can address the distributed collaboration issue. It is due that the CIAT and CDT computations can avoid the cache mismatching under all possible conditions listed in Table 1. For the conditions, the DRTP has the following performance properties: (i) Much reduced EEFR due to no cache mismatching brought by the avoidance. Furthermore, the EEFR can be 0% when the last router in the PP encountered with a lost Capsule can successfully retransmit the Capsule in η times. An example is when there is at least a completely reliable RSP for each router of the PP, and the successful delivery of Report message at each hop of the PP. (ii) Bounded delivery time. The maximum endto-end delivery time for each Capsule (EEDT) satisfies the Eq. (9) in Lem. 2 in no cache mismatch. Therefore, the DRTP makes the best effort to retransmit the lost Capsules. Lemma 2: The EEDT for r d b satisfies Eq. (9) under the recursiveness of the DRTP when there is no cache mismatch.
Proof: According to Th. 1, we can obtain Eq. (10) by replacing ξ next with ξ and ξ with ξ . Furthermore, we add t k (ξ ) − t k−1 (ξ ) on both sides of Eq. (10), we can obtain Eq. (12). In Eq. (12), we can write equations from t k 1 to t 2 in an iterative way. Next, by independently adding the two sides of Eq. (12) and letting k to be b, we can obtain Eq. (12). Thus, to proof Eq. 9, we consider the prerequisite when there is no cache mismatch under the collaboration recursiveness. The prerequisite yields that either the success situation or the condition #3 must happen. Specifically, we need to verify that t b (ξ )−t 1 (ξ ), i.e., the EEDT of ξ for r d b (termed as EEDT-P), on the right side of Eq. (12) is not greater than the total path delay of PP, i.e. τ PP all (d 1 , d b ). For the success situation, the EEDT of ξ is analyzed with the cases on whether ξ has been retransmitted by any upstream router or not, as follows: (a) If ξ is without retransmission, it means that EEDT-P is no greater than the total path delay of PP. (b) Otherwise, i.e. ξ is a retransmitted Capsule, the nonempty Retran for ξ must have been arrived from a router in upstream, e.g. assumed to be r d s . r d s is under the condition #2 with CIAT 2 . Thus, according to Eq. 6, CIAT 2 at an upstream router of r d s , denoted as r d s , is computed from . The subtracted part is just the time difference between the original arrival of the lost ξ -th Capsule at r d k and the actual arrival of its retransmitted copy. Therefore, EEDT-P is reduced by the total time of the subtracted parts of r d s , where the total time contains the time consumed in the retransmission carried out by all upstream routers of r d s . As a result, the EEDT-P is no greater than the total path delay of PP. Hence, in the success situation, the EEDT is equal to Eq. (9). Meanwhile, under the condition #3 of the failure situation, the Capsule delivery time from r d 1 to r d s and the Report sending time from r d s to r d k are no more than τ PP all (d 1 , d s ) and τ PP all (d s , d k ), respectively. Hence, the EEDT is the sum of the two times and the right side of Eq. (7), where EEDT is no more than Eq. (9). Hence, the EEDT has an upper bound computed by Eq. (9) for the success situation and condition #3, and thus the proof is done.

1) FEASIBILITY ANALYSIS
The performance assurance of DRTP discussed above can be affected under heavy link disruptions by the two factors as follows: (i) The violation of the collaboration recursiveness VOLUME 10, 2022 in any assumption listed above. It can be triggered by the loss of Report that can further cause too earlier Request sending during retransmission. The loss rate of Report can be increased along with the increase in the loss rate of the PP.
(ii) The failure rate of retransmission at each hop of the PP. This can be impacted by the number of critical links in the PP, where the link is without RSPs available at its exit point. Therefore, to improve the assurance, the structural features of a typical MPSG under a certain LLR can be summarized as follows: (a) The length of PP should be short enough, so that sending Reports in ζ times can be more reliable. Furthermore, ζ can be adjusted to higher to increase the reliability. (b) The number of critical links should be reduced as few as possible.
In comparison, the non-typical MPSG can be more unassured to increase EEFR due to any of the two factors mentioned above.

V. PERFORMANCE EVALUATION
The proposed DRTP is evaluated in simulations to demonstrate its performance under serious link disruptions.

A. SIMULATION ENVIRONMENT
Fig. 14 presents the simulation environment for evaluating the performance of DRTP. The environment is built based on the ndnSIM simulator with the embedded NDN forwarding daemon [26]. Based on the daemon, we realize the prototype of DRTP. The entire environment has been released as open source online [33] for repeating experiments. Concretely, in the simulator, the four components (see the green box) are designed as follows: (a) The DRTP forwarding strategy realizes the mechanism mentioned in Section III. (b) The PMU and PDC simulations are in accordance with IEEE 37.118.2 standard [7]. (c) The topology loader converts an IEEE bus dataset file to a directed graph stored in the simulator. (d) The DRTP tester provides the main function of executing a simulation according to the experimental parameters given in a configuration file. The file has the suffix name of ''ini'' and includes the path to a bus topology file in IEEE or SC. We execute each simulation running the tester with the waf program provided by ndnSIM. Concretely, the configuration includes the parameters in the following aspects: The capacity of the CS [26] is 10 000 packets to avoid the cache mismatch caused by the cache replacement [24]; (b) η and ζ mentioned in Section III-C all equal to 3 to deal with serious link disruptions; and (c) the maximum waiting time of the reordering of Capsules mentioned in Section III-B is 400 ms that is 20 times of the PMU sending period to ensure correctness of the reordering.  rate is at 10 Gbps, considering that the link is usually an optical line according to IEC 60794-4-10 [15]; (b) the maximum queueing capacity is 100 packets, i.e. the maximum queueing delay of each link is 2 µs [27], to avoid packet drops caused by the full queue; and (c) the processing delay is 1 µs, i.e., the maximum packet processing rate is 1 million packets per second (MPPS), where 1 MPPS is ordinary for the existing commercial solutions for the grids [34], and can be easily achieved by the existing solutions for the NDN data plane [35], [36]. For example, the hierarchical aligned transition arrays with 15 pipeline stages in FPGA can achieve 125 MPPS and 0.12 µs [37]. In addition, NameTrie in software can achieve 3.56, 3.72, and 3.25 million name insertions, lookups, and removals per second, respectively [38]. Additionally, the bandwidth and geographical distance of each link are 10 Gbps and 30 km for general settings [15], respectively, so that the link transmission delay is 100 µs.
Here, all parameters about link delays above are only used for DRTP to estimate the total and queueing delays of each link in the network as mentioned in Section IV. Meanwhile, the PMU sending frequency is typically chosen as 50 Hz according to IEEE C37.118.2 [7].

3) CONFIGURATION OF SIMULATIONS
The tester allows to configure simulation time, and setup link fault events with different levels and ranges that is scheduled in a time. Each simulation records all messages sent and received, as well as events, into logs that are used to evaluate the performance of DRTP.

B. EXPERIMENT DESIGN AND METRICS
Taking into account the two degrading factors in Section IV-D, we design experiments to study the performance of DRTP with the typical and general MPSGs as follows: (i) The typical MPSG. The MPSG is constructed according to Fig. 2 obtained from a real power transmission grid [28], where each node represents a DRTP-enabled router. The MPSG is featured with zero critical links in the PP and the short length of the PP. Hence, we verify that the performance of DRTP is with the 0% EEFR and a bounded EEDT no greater than 15.696 ms under the two assumptions given in Section IV. Here, the EEDT bound is computed using Eq. (9). Furthermore, the quality of the PMU packet delivery process is tested using two metrics as follows: (a) the out-of-order rate at the consumer (OOR), and (b) the packet reordering time at the consumer (PRT). In addition, the average MSRs are tested to demonstrate the message overhead of the DRTP. In the case, the MPSG is without too long PP and critical PP links. This can make DRTP easier to keep the collaboration recursiveness under the PP disruption to verify the performance assurance given in Section IV-D.
(ii) The general MPSGs. We test DRTP with the MPSGs generated using Alg. 1 for the two transmission grids modeled from the real world, namely the IEEE 300 bus dataset [28] and the SC 500-bus dataset [30]. In each grid, we randomly choose 150 communication pairs and generate the corresponding MPSG for each pair. The MPSGs are versatile in the length of the PP and the number of critical links in the PP (see Fig. 3). This is used to test the performance of DRTP under the recursiveness violation and retransmission failures in the adverse conditions mentioned in Section IV-D). We test the performance of EEFR, EEDT, and the satisfaction ratio of EEDT (SRE). SRE is a division between the number of pairs with their EEDTs no greater than the EEDT bound computed using Eq. (9) and the total number of pairs.
For each case above, we run the simulation for 33 s sufficient to capture the retransmission dynamics. Meanwhile, all link fault events start at 3 s when each router in the PP can receive at least one Capsule without retransmission, to ensure the first assumption. Each link fault is simulated at a different level measured by LLR. LLR is a possibility of randomly discarding its arrival packets on the link. Meanwhile, faults with the same LLR for each link are simulated for a certain time in the two different ranges below: (a) Full disruption on all links of the PP. This is used to test the success situation due to at least a reliable RSP for each PP router that ensures 0% EEFR according to Section IV-D. (b) Complete disruption of all links of the MPSG. This is used to test the success and failure situations for relatively smaller and higher LLRs, respectively. The higher LLRs can ensure the second assumption that triggers the condition #3 or #4. However, too high LLRs will lead to the assumption violation, deteriorating their corresponding EEFRs.
Accordingly, we compared the performance of the DRTP with that of the existing hop-by-hop control approaches, including the MFTP [18], R 2 T [19] and SDATP [20]. In the tests, they were configured with the maximum 3 retransmission retrying times that are the same as DRTP, under the PP disruption. The tests cover both the typical and general MPSGs as follows: (i) For the typical MPSG, DRTP is tested in EEFR as well as OOR and EEDT to reveal the effectiveness and efficiency during retransmission, respectively. In addition, we measure the load capacity per each router through the mean number of messages in DRTP arrived at the router per second (MAR), because that each message can be processed by the router in a limited time. (ii) For the general MPSGs, DRTP is tested in EEFR, EEDT, and OOR to reveal the feasibility of DRTP in obtaining its advantages in the generated MPSGs with different path-length features.

C. PERFORMANCE OF DRTP UNDER DISRUPTIONS
The performance of DRTP are comprehensively evaluated in the typical MPSG and general MPSGs as follows.

1) DRTP PERFORMANCE IN TYPICAL MPSG
The performance statistics of the DRTP under the PP disruption are demonstrated in Fig. 15 as follows. Concretely, Fig. 15(a) shows that the EEFR stays at 0%. Meanwhile, Fig. 15(b) demonstrates the maximum of the EEDTs within the range of [0.319, 13.817] ms that is all less than the above bound. Meanwhile, the maximum EEDT starts to fall from 12.895 ms at the LLR of 97% to 1.351 ms under 100% LLR, thanks to the residual penalty under the condition #2. Hence, they confirm the recursiveness. In addition, Fig. 15(c) indicates that the OOR keeps in 0%, due to the maximum EEDT being less than the PMU sending period.
Furthermore, the performance statistics under the MPSG disruption are depicted by Fig. 16 as follows. Fig. 16 Fig. 16(c) and 16(d) depict that the OOR and the maximum PRT remain in 0% and 0 when the LLRs are in [0, 19]%. It is due that the corresponding EEDTs are no more than the sending period. Later, they start to deteriorate for LLRs greater than 20%.
Meanwhile, Fig. 17 depicts that the average MSRs for the DRTP under the PP and MPSG disruptions are all with constant upper limits. Such statistics confirm the correctness of the message complexity analysis given in Section III-C.
Therefore, the DRTP can keep the 0% EEFR and a bounded EEDT under the full PP disruption for [0, 100]% LLRs and the complete MPSG disruption for [0, 18]% LLRs. Meanwhile, the above results also demonstrate the high-quality packet delivery with the EEDT no less than 20 ms and 0% OOR generally demanded in IEEE C37.118.2 [7].

2) DRTP PERFORMANCE IN GENERAL MPSGs
Under the full PP disruption, Fig. 18     demonstrate the changes in EEDT under different LLRs for the two topologies, respectively. In detail, for the two topologies, when the LLRs are 20%, 60%, and 100%, the corresponding mean EEDTs are ( Fig. 20(a) depicts that EEFRs of MFTP, the R 2 T and SDATP are all deteriorated with the increasing of their LLRs. It is due to the lack of exploitation of the RSPs during their retransmissions in coping with the corresponding multiplicative increase in PP losses. Meanwhile, the EEFR of SDATP is worse than those of the MFTP and R 2 T, due to the distributed collaboration issue. In comparison, DRTP has an advantage in much reduced EEFR to 0% during the PP disruptions. Second, Fig. 20(b) depicts the load capacity per router in MAR, along with different LLRs. When LLR is no greater than 0%, 6%, and 16%, the number of DRTP is all greater than SDATP, R2T, and MFTP, respectively. Additionally, when the LLR is no less than 17%, the number of DRTP increases, which is higher than the comparative solutions. This is because DRTP does it best to perform more retransmissions to effectively recover the packet losses in the PP in 0% EEFR. Fig. 20    reliable a retransmission is, we assume the EEDT of a lost packet on the PDC side as an impossible maximum value of 10 s. In detail, when LLR is 0%, 25%, 50%, 75%, 100%, the maximum EEDTs of all pairs of DRTP are 413 µs, 13.82 ms, 13.82 ms, 13.82 ms and 1.35 ms. In comparison, when LLR is 0%, 25%, 50%, 75%, 100%, the EEDTs of (100, 100, 100)%, (68.6, 75.9, 31.13)%, (14.07, 29.93, 6.26)%, (0.67, 37.33, 0.33)% and (0, 0, 0)% pairs of MFTP, R2T and SDATP are no greater than 413 µs, 13.82 ms, 13.82 ms, 1.35 ms, respectively. The above results confirm an obvious advantage of DRTP in significantly low EEFR and EEDT with an acceptable message load capacity. Fig. 21 demonstrates the performance comparison of DRTP with MFTP, the R 2 T and SDATP as follows. Fig. 21 21(j) show the mean OORs of them in the two topologies, respectively. In detail, the OORs of MFTP and R2T remain 0%. Meanwhile, the OORs of SDATP and DRTP at 20%, 60%, and 100% LLRs are (12.4, 0.11, 0)% and (1.08, 9.59, 0)%, respectively. However, MFTP, R2T, and SDATP have significantly higher EEFRs than DRTP, meaning that DRTP spends more effort to recover lost packets.

VI. CONCLUSION AND FUTURE WORK
The proposed DRTP enables the resilient hop-by-hop retransmission control exploiting plentiful RSPs to increase the PP reliability. Moreover, the recursiveness of the retransmissions is kept in addressing the distributed collaboration issue, which has not been concerned in the existing approaches. The performance of DRTP is evaluated for both the typical MPSG with high reliability and the general MPSGs constructed by our algorithm from the IEEE 300 bus and SC 500 bus topologies. The topologies are all modeled according to real transmission grids. The evaluation is conducted with comparisons to the existing hop-by-hop transport control approaches in EEFR and EEDT. The numerical results show that DRTP can significantly reduce EEFR with a low EEDT under serious disruption of links in these two kinds of MPSGs. For future work, several directions are worth research effort, including: (a) the generation of MPSG with an EEDT constraint, to further optimize the performance of DRTP, and (b) extension to other machine-to-machine communications, e.g., the recently studied 5G based smart grid communications [1].
CHUNMING WU received the Ph.D. degree in computer science from Zhejiang University, in 1995. He is currently a Professor with the College of Computer Science and Technology, Zhejiang University, where he is also the Associate Director of the Research Institute of Computer System Architecture and Network Security, and the Director of the NGNT Laboratory. His research interests include software-defined networking (SDN), SDN and cloud security, proactive network defense, and intelligent cloud networks.