Learning-Based Adaptive Sliding-Window RLNC for High Bandwidth-Delay Product Networks

Sliding-window random linear network coding (RLNC) is a good fit for achieving low in-order delivery delay in future-generation networks characterized by lossy links. In high bandwidth-delay product networks, however, the issue of integrating RLNC with transmission control protocol (TCP) flow and congestion control poses a significant challenge. In this paper, we propose an innovative reinforcement learning (RL) framework that addresses this issue by decoupling the RLNC sliding window from TCP and dynamically adjusting it to enhance network performance in terms of goodput, in-order delivery delay, and decoding complexity. By employing RL, we enable autonomous decision-making for adjusting the sliding window of RLNC, which operates independently of TCP. This decoupling allows RLNC to adapt dynamically to the varying conditions of the network, without prior knowledge of its characteristics. By leveraging the benefits of RLNC and TCP separately, we achieve more efficient and effective utilization of network resources. The results highlight significant improvements in goodput, in-order delivery delay, and decoding complexity. These improvements are crucial because in network coding, there is always a trade-off between goodput, delay, and decoding complexity, and minimizing this trade-off is very challenging. Using RL and decoupling of RLNC sliding window from TCP, we address this challenge and minimize the trade-off significantly. Goodput is improved by up to 11%, the in-order delivery delay is reduced by a factor of 9%, and coding complexity shows an improvement of up to 45% compared to the state-of-the-art.


FIGURE 1.
A network model with an RLNC encoder at the sender device and an RLNC decoder at the receiver device.
packets and sends them to the receiver via the bottleneck link between router 1 (R1) and router 2 (R2). The sender can choose to send only data packets (No NC), coded packets, or a mix of both depending on the user's requirement.
On the receiver side, the RLNC decoder allows for the decoding of the original packets using a subset of received coded packets. After decoding a coded packet, the receiver sends an acknowledgment (ACK) packet to the sender to confirm the receipt of said coded packet. Instead of relying on receiving all the original packets individually, the receiver can use a set of coded packets to reconstruct the original packets. This property of RLNC provides robustness against packet loss and network errors, as any subset of coded packets is potentially sufficient for decoding.
RLNC has several advantages in lossy network environments, such as wireless networks or networks with high congestion. It can improve the reliability of data transmission by reducing the impact of packet loss and increasing the likelihood of successful data recovery. Additionally, RLNC can enhance network efficiency by reducing retransmissions and enabling opportunistic data forwarding.
RLNC can be broadly categorized into block RLNC and sliding-window RLNC. Block RLNC operates on fixed-size blocks of packets, where a block typically consists of a predetermined number of packets. The block size is predetermined and remains constant throughout the transmission. Encoding is performed across the blocks of packets. The sender applies RLNC to the collected packets in the block. It randomly generates coefficients and linearly combines the packets using these coefficients to create coded packets. Each coded packet contains a linear combination of the original packets in the block. This creates blocking delay which is not good for applications expecting in-order delivery. Unlike block RLNC, which operates on fixed-size blocks, sliding-window RLNC adapts to varying network conditions and packet arrivals. The sender and receiver both initialize a sliding window. The window size represents the number of packets that can be encoded or decoded at a given time.
The sender collects a certain number of original packets based on the current window size. The number of packets collected depends on the window size and can vary as the window slides. The receiver collects the received coded packets within its window size. The number of received packets in the window can vary due to network conditions and potential packet loss. The receiver needs a sufficient number of linearly independent coded packets within the window to successfully decode the original packets. After decoding, both the sender and receiver slide their windows to accommodate new packets. The window slides by a certain number of packets, which can be a fixed amount or dynamically adjusted based on network conditions or feedback from the receiver.
Sliding-window RLNC avoids blocking delay by encoding across ''yet not-acknowledged (non-ACKed) packets'', which also form the transmission control protocol (TCP) sliding window (TCP w ). As new acknowledgments (ACKs) arrive, the sliding window moves forward, also moving the set of data packets that will be encoded here onward. This allows the receiver to decode as soon as it has sufficient encoded packets to reconstruct a data packet. RLNC provides redundancy for reliability, while tuning the TCP sliding window affords better flow and congestion control.
Sliding-window RLNC provides flexibility in terms of adding new data packets for encoding while removing older ones. In technical terms, this is known as closing the window. Previously, algorithms would adopt an infinite window, where every encoded packet contains all previously transmitted data packets [5], [6], [7]. These algorithms, however, are not practical because of excessive memory usage and computational complexities. Later on, a finite-window approach was introduced where the number of data packets included in the encoding process does not exceed a predefined limit, which happens to be TCP w in general, e.g., [8], [9], and [10].
Most finite-window algorithms are also systematic, which means that they place encoded packets among a stream of data packets (as shown in Fig 1). Fig. 2 shows an example of the systematic finite sliding-window operation. {s 1 , s 2 , . . . , s n } 78836 VOLUME 11, 2023 represent data packets, and {c 1 , c 2 , c 3 , . . . , c n } represent encoded packets. The coding rate (R) determines the insertion of an encoded packet into the stream after every k data packets; hence, (1) In this example (Fig. 2) R = 3/4, which means that an encoded packet is inserted after every k = 3 data packets. The red cross shows a packet loss during transmission. The dotted and dashed lines show the range of data packets involved in the encoding process of each c i . In this paper, we highlight the issue of either having a common window for TCP flow control and encoding or having an independent but constant encoding window i.e., E w . We propose an adaptive learning-based slidingwindow RLNC framework called LS-RLNC to solve the decision problem of how the E w should evolve over time to cope with the changing network environment. LS-RLNC utilizes reinforcement learning (RL) to achieve this. The goal is to maximize the overall goodput while keeping the in-order delivery delay as low as possible.
RL is a branch of machine learning that focuses on training an agent to make sequential decisions in an environment to maximize a long-term reward. It involves training an agent by interacting with an environment and getting positive or negative feedback as a reward for the actions taken by the agent.
In the context of network communication, RL can be applied to optimize various aspects of network performance, such as throughput, latency, energy efficiency, and resource allocation as suggested by the literature [11], [12], [13], [14], [15]. In NC, RL is used in decision-making scenarios in general. For example [11] uses RL to solve the decision problem of when should the sender transmit a coded packet in a systematic NC scenario in nonterrestrial networks. This implies that the prospect of learning the ideal action-value function in RL, by engaging with the environment without a priori knowledge is crucial to NC. This is appealing to our decision problem because mathematically modeling it under changing network conditions is complex. Therefore we choose RL to dynamically evolve the E w .
This paper focuses on sliding-window RLNC in specific and other sliding-window NC schemes in general. LS-RLNC can be modified to accommodate other sliding-window NC variations (e.g., [16], [17], [18], [19]). We highlight our contributions in this paper as follows: • A practical and adaptive learning-based sliding-window RLNC scheme called LS-RLNC with congestion and delay feedback is proposed. We carefully devise the state space and reward function to maximize goodput and reduce in-order delay.
• LS-RLNC utilizes the explicit congestion notification (ECN) feedback and the receiver's ACK feedback to evolve the E w carefully.
• We have implemented and evaluated LS-RLNC in Mininet and Python 3. LS-RLNC outperforms the stateof-the-art in goodput, in-order delay, and decoding complexity.

II. RELATED WORK
Over the years, several sophisticated schemes have shown better overall performance of the sliding-window approach by carefully designing the placement mechanism of encoded packets [5], [11], [20]. Authors in [21] use an adaptive algorithm for sliding-window RLNC to estimate channel conditions and adjust the retransmission rates to improve the overall performance. Caterpillar RLNC (CRLNC) [22] is another sliding-window RLNC variant that does not rely on feedback and focuses on decoding simplification. CRLNC with feedback (CRLNC-FB) [10] and [23] are continuations of CRLNC that embed selective-repeat automatic repeat request (ARQ) and multihop support into CRLNC respectively. Authors in [3] propose a lightweight network-coded ARQ protocol for ultra-reliable low latency communication (URLLC). In this work, authors decouple the RLNC sliding window from TCP and show through simulations that it outperforms selective-repeat ARQ and CRLNC-FB [10].
Compared to retransmission-based schemes, slidingwindow RLNC schemes are efficient in minimizing the overall end-to-end in-order delivery delay, which is crucial to most future network delay-sensitive applications [5], [7].
In general, these schemes (except [3]) trade throughput for improved delay by placing multiple encoded packets as redundancy. However, the problem of using the same TCP w remains intact. This is problematic because i) it deeply impacts the encoding process and ii) it impacts the complexity-performance trade-off. It determines the nature of each encoded packet that is created, i.e., the maximum number of data packets involved in its encoding. In addition, it determines the size of the decoding matrix, i.e., the maximum number of packets required to perform decoding. Further, it impacts the size of the encoding vector of each encoded packet, consequently affecting the encoding quality. VOLUME 11, 2023 Apart from added complexity and delay, there is another major issue in using a common window for both flow control and encoding: the inability of TCP w to cope with bursts of errors for encoding. This problem is highlighted in Fig. 3. The data packets s 5 and s 6 are lost during the transmission. Two linearly independent encoded packets are required to recover the lost data packets. In the case of common TCP w (Fig. 3a), the decoder has to wait for both c 2 and c 3 because they include s 5 and s 6 in their composition. However, when c 3 is received, the sliding window has moved forward and s 1 and s 2 are already deleted (delivered). In this scenario, c 2 and c 3 are not enough for decoding. This issue can be handled with an independent encoding window (E w ), as shown in Fig.3(b).
Here the size of E w is 4 instead of 6. In this case, c 2 and c 3 can be utilized for decoding to obtain the missing data packets because s 5 and s 6 are still within range and the dependability on previous data packets is limited.

III. LS-RLNC SYSTEM MODEL
We outline the LS-RLNC system model in this section. The system model comprises two modules named the network module and the RL module. Fig. 4 shows the workflow of LS-RLNC and the integration of its modules in relation to the network model (Fig. 1). The details of each module are as follows.

A. NETWORK MODULE
This section explains the theory and network components of LS-RLNC. We propose an independent learning-based E w that can adapt well to changing network conditions. In general, E w ≤ TCp w According to this relation, the current sliding-window RLNC schemes can be considered a unique case where E w = TCP w . However, it is also true that we can define E w as where d is defined as the encoding depth, i.e., the number of k-packet groups participating in the encoding process and k represents the number of data packets as indicated in (1). Because the decoding complexity is O(M 3 ), where M is the decoding matrix size [3], it is apparent that the size of M depends on TCP w . Therefore, having E w limited to a subset of TCP w is beneficial because i) it reduces the decoding complexity consequently improving decoding delay and ii) it also simplifies the encoding process by shortening the encoding vector size. However, the size of E w is given by (2). Hence, E w is affected by d and k. This implies that E w is dependant on (1) as well. This becomes a decision problem of how E w should evolve and is the central focus of this paper.
Consider that data from an application are buffered in the TCP sender queue as {s 0 , s 1 . . . s n }. Each data packet comprises K equal bits. Whenever the TCP w allows a sending opportunity, a data or encoded packet is sent according to (1). Let's say a data packet s i sq +1 is chosen for transmission; then, i sq is the index of the most recent data packet that was sent. If an encoded packet, e.g., c k is next in line to be transmitted, then it is created by encoding all data packets present in E w as where l E is the index of the last non-ACKed data packet in E w , the upper limit of c k is u E = i sq and g (k) x are coefficients randomly chosen over a finite field GF(2 m ). 2 l E , index of c k , and u E are carried in the packet header for identification.
At the receiver end, arriving packets (data/encoded) are buffered. At the application level, data packets are in-order delivered. Let i od represent the index of the most recent in-order delivered data packet. s i od +1 gets delivered to the application when it becomes accessible; otherwise, the lost data packet is retrieved from available encoded packets by initiating the decoding process. During this time, arriving data packets are halted in the buffer until the lost data packet is recovered and delivered to the application. During this period, more data packets can potentially be lost., e.g., in case of burst error or congestion control kicking in. Let and Hence, (4) and (5) form the limits of the decoding window (D w ) which is given by The decoding is done using on-the-fly Gaussian elimination [24]. When the number of encoded packets in D w equals the number of lost data packets, the decoder will most likely succeed given a large enough GF(2m). The application then receives the recovered data packets, and the index of the most recent in-order data packet is modified as i od = u D .
TCP w is directly affected by incipient congestion (IC) because it controls the sender's rate. We utilize this information for intelligent decision-making about E w . We adopt enhanced explicit congestion notification (EECN) [25] to obtain an early insight into the network condition. The EECN feedback mechanism is given in Algorithm 1.  [26]. The EECN mechanism dynamically adjusts the ECN markings on packets, providing more precise and accurate feedback to the sender. EECN informs the sender of incipient congestion without waiting for the receiver's feedback. This is done by exploiting programmable network devices to send statistics to the sender directly using software-defined networks (SDN).
From [20], we obtain the expected decoding delay for all in-flight packets in TCP w as where f is the fraction of encoded packets in TCP w and p e is the packet loss rate. We denoteT E t as the expected decoding delay for all in-flight packets in E w . Hence, in our case, as we have decoupled E w from TCP w , we can modify f as the fraction of encoded packets in E w . This is valid because the encoding is performed according to (2) instead of TCP w . Note that both f and p e can be estimated. We estimate f as f t = number of encoded packets in E w number of all packets in E w (8) This intrinsically allows LS-RLNC to have a lower in-order delay compared to typical sliding-window RLNC schemes. Similarly, we can estimate p e . From this discussion, we can write the expected decoding delay as Finally, as a safety net, we also implement a retransmission policy that is triggered only when E w = W and the decoding process fails to recover the lost packet in a certain time period T .

B. REINFORCEMENT LEARNING MODULE
In this section, we explain the use of RL in LS-RLNC. The integration of the LS-RLNC's RL module with the network module is shown in Fig. 4. The goal is to evolve E w such that goodput is maximized and in-order delay is kept as low as possible. To achieve this, we first record some network information. The sender keeps track of the timestamps of each data and encoded packet, the total number of each type of packet sent up to each timestamp, and the packets comprising E w . The sender also stores information from ACK packets, namely, the ID of the last received packet, the value of i od , E w , and the number of acknowledged data and encoded packets. We now formulate the RL design for LS-RLNC. This RL design comprises an action space, a state space, and a reward function.

1) ACTION SPACE
Our action space is rather simple, i.e., We design the state space of LS-RLNC as follows: where D w t is D w , E w t is E w , and IC t is the level of incipient congestion at time t respectively. Eq. 10 gives us the necessary information to estimate the end-to-end in-order delay. IC is zero if there is no congestion and nonzero when congestion is anticipated; hence, the effect of queuing delay is captured. Further, D w and T E give us the overall decoding delay. Additionally, note that these variables are linked to goodput as well. Recall from Section III that data are in-order delivered to the application when s i od +1 is received or when a coded packet completes the decoding process. In both cases, goodput is realized when D w is 0. Consequently, the RL agent can track the goodput and in-order delay using ACK feedback and the rest of the stored records at the sender. From the above discussion, we infer that D w = 0 being true after a decoding process, results in an episode of LS-RLNC. Any state S t with zero D w is a terminating state. After each episode, the increase in the number of s i od from that at the start of the episode divided by the episode duration yields the realized goodput. If decoding is performed and/or the value of i od does not increase, a negative reward relative to T E and D w is given.

3) REWARD FUNCTION
The formulation of the action space and the state space lead us to formulate our reward function as follows: where a, b, and c are tunable nonzero values. Note that a negative reward is given when D w is zero but i or is not increased. This is because it would be considered a wasted transmission (encoded packet) with no goodput. We adopted Q-learning in this work because it uses a simple value iteration update method, which is feasible for online learning. Q-learning updates the function as (12) where α is the learning rate, γ is the discount factor, and r t+1 is the immediate reward. Further, we adopted the ϵ-greedy method to ensure exploration during the initial phase of learning. With ϵ-greedy, a random action a t is taken with probability ϵ; otherwise, a greedy action is taken with probability 1 − ϵ. The RL-related parameters, given in Table. 1, were used following common RL practices and exhaustive trial and error experimentation.
No function approximation was used in this work because D w , E w , and IC are discrete and do not exceed their limits. Therefore, using tabular methods for RL is feasible because the state space is not large.

IV. SIMULATION SETUP
In this section, we discuss the simulation setup to evaluate our proposed scheme: LS-RLNC. We evaluate the performance of LS-RLNC by comparing it with rapidARQ [3] and traditional sliding-window RLNC. 3 Sliding RLNC uses the same coding rate R obtained from (1) and encodes packets across the entire TCP w instead of E w . However, rapidARQ uses a constant E w and relies on the selection of d for better performance. We decide the best value of d by exhaustive simulations.
We evaluated LS-RLNC in Mininet and Python 3. Mininet is a Python-based network simulation tool that creates a realistic virtual network, running real kernel, switch, and application code, on a single machine. Because it is Pythonbased, the RL and EECN modules, also written in Python, are called by the Mininet script as Python functions. We used a dumbbell topology similar to Fig 1, where an application at the left leaf communicates with an application at the right leaf. The application transmits at a rate of 20 Mbps, while the bottleneck link's bandwidth is set to 10 Mbps to observe congestion. Congestion is generated by inducing non-ECN traffic to the network randomly. 4 The edge router implements ECN-enabled queuing, where packets are marked when a certain queue threshold is reached. In case of overflow, the queue is flushed.
We trained our agent in the aforementioned environment by running 4000 simulations. In each simulation, the application ran until 4000 data packets were in-order delivered with p e uniformly randomly chosen from the given range. We categorized our simulations into four scenarios. All simulation scenarios follow the simulation parameters given in Table. 1. Scenario 1 deals with different propagation delays and its associated results are shown in Fig. 5. Scenario 2 looks at the effects of dynamic p e on the cumulative goodput, and its associated results are given in Fig. 6. Scenario 3 is created to analyze the behavior of LS-RLNC under a bursty p e model. The results of this scenario are given in Fig. 7. Scenario 4 is devised to observe the decoding complexity of LS-RLNC in terms of decoding window size and the average goodput as well. The results of scenario 4 are shown in Fig. 8 and 9. Fig. 5 shows the performance of all tested schemes in terms of delay with uniform p e . We tested the schemes for propagation delays 50 ms, 100 ms, 200 ms, and 300 ms in each subplot respectively. The cumulative moving average (MA) of in-order delivery delay incurred by each data packet is plotted. The MA is calculated over a window of 100 packets. Both LS-RLNC and rapidARQ showed lower in-order delay compared to sliding RLNC. However, LS-RLNC showed better performance than both even though we selected d for rapidARQ through exhaustive simulations. This is because the other schemes did not consider IC, T E , and D w . Early insight into the network congestion state helps reduce packet loss at the cost of a lower sending rate (explained later). The separation of E w from TCP w provides better protection to data packets and reduces the decoding delay. Recall from Sec. III how T E and D w affect the in-order delay. We reduced T E by limiting f , which relies on E w . This led to a shorter D w in general (as we show later) resulting in a lower in-order delivery delay.
We show the packet drop rate in each scenario as well. The packet drops occur mainly owing to two reasons: i) p e and ii) IC. LS-RLNC and rapidARQ showed good resilience to IC compared to sliding RLNC. However, LS-RLNC showed even lower packet drops than rapidARQ. This is because i) LS-RLNC got an early indication of IC (due to the use of EECN [25] framework) and ii) our agent dynamically changed E w to provide better protection to recent data packets. Fig. 6 shows the in-order cumulative goodput of each scheme. Cumulative goodput is measured as the amount of in-order data delivered divided by the elapsed time. p e was dynamically randomly chosen from 1 − 15%. All the schemes experienced similar loss patterns and attempted to deliver 4000 data packets. The propagation delay, in this case, was 100ms. The overall cumulative goodput of all schemes was similar. This is mainly because all schemes used a similar coding rate R given in (1). We could still observe some differences, where rapidARQ showed slightly better goodput and completed the transmission a little earlier. This is because LS-RLNC provides better resilience to IC at the cost of a lower sending rate, which impacts the overall goodput. We tested the cumulative goodput results for other propagation delays and found similar results. However, this is only when a uniform p e model is used, as we will see later that in bursty error models, LS-RLNC shows improved goodput.
In Fig. 7, we highlight the effects of a bursty p e model on each scheme. We adopted a similar on-off model as [3] for this purpose. In this model, there is an ''on'' period and an ''off '' period. During these ''on'' and ''off'' periods the p e is given by p e on > p e (14) and p e off = 0.
During an on-off cycle, the average p e is similar to that of a uniform channel. The burstiness of the channel is controlled by where T on and T off are the time periods for p e on and p e off respectively. As the level of burstiness increases, the average goodput is reduced. However, compared to the other schemes LS-RLNC shows better goodput. We also tested the burstiness in relation to the packet drop rate and found similar results.     8 shows the decoding complexity in terms of the average decoding window size. We chose a different R for each p e in this case to meet the channel capacity (1-p e ). We also chose the value of d that generated the max average goodput through exhaustive simulations. LS-RLNC shows better decoding complexity by keeping a low decoding window. This is because the agent only allows the E w to expand if it yields a better r. LS-RLNC has a lower D w mainly due to the separation of E w and TCP w . E w generates a lower T E compared to TCP w because f is reduced, which consequently lowers the decoding complexity. LS-RLNC reduces the decoding complexity by an average of 35% compared to rapidARQ and by an average of 78.5% compared to sliding RLNC. Fig. 9 shows the average goodput observed in relation to the decoding complexity (Fig. 8) for each p e in scenario 4. 78842 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  rapidARQ has slightly higher goodput when p e is low but as p e increases LS-RLNC shows better goodput. This is the trade-off of LS-RLNC where it improves the decoding complexity at the expense of goodput. However, as the p e increases, this trade-off becomes minimal and LS-RLNC gives better decoding complexity as well as average goodput.

V. CONCLUSION
LS-RLNC is an adaptive sliding window RLNC for high bandwidth-delay product networks using reinforcement learning. LS-RLNC decouples the encoding window (E w ) from TCP sliding window (TCP w ) and uses network and receiver feedback to optimize the value of E w . LS-RLNC shows that a carefully designed RL scheme to dynamically evolve E w can achieve high goodput with low in-order delivery delay and reduced decoding complexity. Further, we show through simulations that LS-RLNC has better overall performance than state-of-the-art sliding RLNC schemes. LS-RLNC improves the goodput by up to 6-10%.
In-order delivery delay is reduced by up to 11% and decoding complexity is reduced between 28-45%. These improvements in performance are crucial to the utility of RLNC because in network coding there is always a trade-off between goodput, delay, and decoding complexity. The results show that LS-RLNC minimizes this trade-off effectively. The results also verify that in scenarios with a bursty error model, LS-RLNC shows better resilience and improved goodput compared to state-of-the-art schemes while maintaining low in-order delivery delay.