ELISE: A Reinforcement Learning Framework to Optimize the Slotframe Size of the TSCH Protocol in IoT Networks

The Internet of Things is shaping the next generation of cyber–physical systems to improve the future industry for smart cities. It has created novel and essential applications that require specific network performance to enhance the quality of services. Since network performance requirements are application-oriented, it is of paramount importance to provide tailored solutions that seamlessly manage the network resources and orchestrate the network to satisfy user requirements. In this article, we propose ELISE, a reinforcement learning (RL) framework to optimize the slotframe size of the time slotted channel hopping protocol in IIoT networks while considering the user requirements. We primarily address the problem of designing a framework that self-adapts to the optimal slotframe length that best suits the user's requirements. The framework takes care of all functionalities involved in the correct functioning of the network, while the RL agent instructs the framework with a set of actions to determine the optimal slotframe size each time the user requirements change. We evaluate the performance of ELISE through extensive analysis based on simulations and experimental evaluations on a testbed to demonstrate the efficiency of the proposed approach in adapting network resources at runtime to satisfy user requirements.

cutting-edge technologies have created novel and essential applications for industrial operations such as smart cities, intelligent energy management, transportation, homes, waste management, etc.
At the hardware level, such applications are realized through electronic devices embedded with intelligent computing, communication systems, sensors, and actuators.The architecture of these devices, along with the need to make them flexible, cost-effective, and able to be embedded into even the smallest things, has led to the development of Wireless Sensor Network (WSN) technology.WSNs have been deployed in a wide variety of industrial applications [1]- [3].The overall architecture, characteristics, and applications impose design constraints on their size and cost, resulting in strict resource limitations, including computation capabilities, energy, memory, and communication bandwidth.The resources of WSNs come at a significant cost; therefore, they need to be managed intelligently to ensure optimal performance for the longest possible period of time.
One of the most popular performance metrics for WSNs is energy consumption.A pioneering approach in reducing energy consumption is presented in [4], which proposes a cluster-based routing approach to extend the network lifetime.Building on this work, various research studies [5] propose new algorithms and architectures to find the best topology and routing strategies that minimize energy expenditure when transmitting packets from source to destination.
Alternatively, other approaches focus on minimizing the energy consumption of wireless sensor nodes by regularly turning off their radio chipsets [6].However, the implementation of such sleeping periods requires strict scheduling algorithms to establish the transmitting and receiving times between neighboring nodes.Deterministic Media Access Control (MAC) protocols have been introduced to address these limitations.Scheduling algorithms orchestrate the transmissions and receptions for all sensor nodes in the WSN.Transmissions are scheduled in a way that allows non-interfering transmissions to occur simultaneously without colliding with each other.Moreover, these protocols take advantage of frequency diversity in their radios to increase network capacity and reduce interference from external devices.
The IEEE 802.15.4 -TSCH standard defines the set of functionalities required to run link scheduling for low-rate WSNs [7], [8].The scheduling algorithm, often referred to as the scheduler, generates cyclic schedules (referred to as slotframes) that determine the physical channel to use for transmission at each time point.
The schedule of the TSCH network directly affects its performance.Redundant links improve link reliability and latency, but also increase power consumption as receiving nodes wake up their radios more frequently.The slotframe size represents the length of the schedule and indicates how often the schedule repeats.A smaller slotframe size improves network reliability and latency as the links repeat more frequently, but at the cost of higher power consumption.
Conversely, a larger slotframe size minimizes network power consumption but compromises network reliability and latency.Therefore, the scheduling algorithm must consider the user requirements to tailor a schedule that meets the application's needs.Additionally, the scheduling should be flexible, adaptable, and reliable to accommodate dynamic changes in the environment and user requirements.

A. Contributions
In this article, we propose an open-source reinforcement learning framework named ELISE 1 to optimize the slotframe size of TSCH networks, considering dynamic changes in user requirements.We address the problem of designing a framework that self-adapts the slotframe size of the TSCH schedule to the optimal length that best suits a set of user requirements.ELISE guides the network through a set of actions to determine the optimal slotframe size whenever user requirements change.
The main contributions of this article are as follows: 1) We develop a novel open-source framework that enables centralized network resource management and run-time reconfiguration of WSNs.2) We develop a reinforcement learning solution that utilizes the ELISE framework to self-adapt the network's 1 https://github.com/fdojurado/SDWSN-controller.gitreliability, power efficiency, and delay based on user requirements.3) We design a reward model based on a multi-objective cost function that facilitates the selection of the best network configuration to meet user requirements.4) We evaluate the performance of ELISE through extensive analysis using simulations and experimental evaluations on a testbed.
The remainder of this article is organized as follows.Section II provides a technical background on key concepts in the framework and an overview of related research.Section III provides a detailed description of the framework components.Section IV explains the design of the reinforcement learning framework.Section V presents the experimental layout, the approximation model of the environment to expedite the learning process of the reinforcement learning framework, the training process, and the experimental evaluation.Finally, Section VI presents the main conclusions and potential areas for future research.

II. BACKGROUND AND RELATED WORK
In this section, we briefly introduce the network model and the core technologies used throughout this paper: TSCH and Software-Defined Wireless Sensor Networks (SDWSNs).We then discuss research works that have used these technologies to improve network performance.

A. Network model
We model the network as a directed graph G(V, E), where V is the set of nodes, and E is the set of physical links.A node ν i ∈ V is the sender and/or receiver of network traffic.The number of network nodes is denoted with |V|.A link ϵ i,j ∈ E is a full-duplex link that connects the two nodes ν i and ν j .since the links are bi-directional, the link ϵ i,j is equivalent to ϵ j,i .The number of network links is denoted with |E|.
The network traffic is modeled with the concept of streams (also called flows) that represents a data packet from one sender (talker) to one or multiple receivers (listeners).We denote the set of network streams as T .A stream τ i ∈ T is characterized by the talker and the receiver.In ELISE, we limit the number of listeners for each stream to one, i.e., unicast communication.However, the model can be easily extended to support multicast streams by adding each sender-receiver pair as a stream.
The path for the stream τ i is determined as an ordered sequence of directed links and denoted with r i ∈ R. Besides, |r i | represents the number of links in the path.For example, the stream τ 1 ∈ T sending from the node ν 1 to the node ν 3 has the route r 1 = {ϵ 1,2 , ϵ 2,5 , ϵ 5,3 }.Using the set of routes R, we define the function H : ν i −→ N which takes the node ν i as the input and returns the node rank, i.e., the number of links originated for the node ν i , as the output.Fig. 1 presents an example of a TSCH schedule for six-node network topology.

B. SDWSNs
We model the software layer of the network using the concepts form Software-Defined Networking (SDN) and Wireless Sensor Network (WSN) that we call SDWSN.This approach has been devised as a potential pathway to solve the management and run-time reconfiguration complexities currently found in state-of-art WSNs.It adopts concepts from Software-Defined Networking (SDN).This new WSN architecture divides the control from the data functions, allowing the logically centralized controller to become reprogrammable and the WSN to be abstracted for applications and network services [9].SDN separates the network into three network planes: application, control, and data plane.The application plane hosts applications and programs that send information about the network requirements to the SDN controller.In contrast, the control plane is a logically centralized entity that processes application requirements and sets up the network infrastructure resources to satisfy them.Lastly, the data plane is the network infrastructure with little intelligence that follows orders from the control plane.Readers interested in a thorough background, challenges, and benefits of SDWSNs can refer to [9]- [11].

C. Orchestra
Orchestra [12] is designed to run multiple stacked slotframes (static schedules) that repeat at different periods to ensure they do not interfere evenly.Each slotframe is allocated to a specific network plane that is defined by SDWSN.The scheduler selects the slotframe with higher priority to run when multiple slotframes need the communication medium simultaneously.In its default configuration, Orchestra runs three slotframes: i) Enhanced Beacon (EB), ii) unicast, and iii) default traffic slotframes.The EB slotframe is a communication link from sensor nodes to its children to set the time source.The unicast slotframe contains links to every neighbor in the WSN.The default slotframe is used for traffic other than EB and unicast packets.

D. TSCH
TSCH is a globally synchronized network where traffic is transmitted based on a static cyclic schedule table called slotframe C that repeats with the period equal to the slotframe size |C| [13], [14].In a TSCH network, the slotframe (schedule) is divided into equal-length timeslots as shown in Fig. 1.We denote each timeslot with c j ∈ C and its length with |c| (same length for all timeslots).The timeslot length |c|, typically 10 ms, is long enough for the transmission and acknowledgment of frames.
To this end, the slotframe size C is equal to the size of n timeslots, denoted with |C| = n × |c|.Within a slotframe, timeslots are counted with their subscripts j.Similarly, the timeslots can be counted from when the network booted with an Absolute Slot Number (ASN).The ASN serves as a global clock, and it increases at every timeslot.
Each timeslot c j is divided into a fixed number of cells.Each cell denoted with c k j indicates the channel offset k.For channel hopping a cell is used to find the physical channel to transmit.Considering an array of channel frequencies F to hop over, the channel frequency f for the cell c k j is calculated in Eq. (1).
To this end, each cell will select a different physical channel at consecutive slotframes.The scheduler oversees the role of cells in the slotframe.The roles can be shared, dedicated, or empty cells.Shared cells (light blue cells in the example) are contention-based.They are used by multiple transmitters, increasing the probability of interference as they can transmit simultaneously.Reliable transmissions resend frames using a back-off window when no acknowledgment is received.Dedicated cells (cells labeled with a pair of nodes) are contentionfree.They are allocated carefully not to cause interference issues with other cells.Retransmissions can occur due to external interference or bad radio link quality [15].
The example provides an illustration of a TSCH schedule that has been designed for a six-node WSN.The slotframe size is five timeslots n = 5.There are six cells with a nonempty role, one shared cell shown with c 2 4 .There are five dedicated cells that permit the communication of nodes with their parents.There are also two non-interfering transmissions at the same slot offset (c 2 2 , c 3 2 ).

TSCH
SDWSNs RL [12] 2015 ✓ ✗ ✗ ✗ Autonomous scheduler for TSCH without control overhead that does not rely on centralized or distributed entities.
[16] 2017 ✗ ✓ ✗ ✓ A decision-making approach to select the routing protocol that best suits the application requirements to get optimal performance using supervised learning.
[17] 2020 ✓ ✓ ✗ ✗ An SDN-based network architecture that provides support to mobile nodes using TSCH.
Mobile nodes have one up-and down-link to every node in the network regardless of their position in the WSN.
[18] 2020 ✗ ✓ ✗ ✗ An SDN-based approach to pinpoint mobile nodes in WSNs.The approach features a mobility detector and a k-means cluster algorithm to decouple static from mobile nodes.
[15] 2020 ✓ ✗ ✗ ✗ A low-latency distributed scheduling function to optimize the End-to-End delay, and reliability.
[ whether the given article considers user requirements, and major contributions.Research works [17], [18] strive to improve the network performance using SDWSNs.The centralized architecture of SDWSNs allows the control plane to build a global view of the network that permits it to make better decisions, in this case, to provide support for mobile sensor nodes.The research work in [17] uses a static TSCH schedule with redundant links for mobile nodes to improve reliability, whereas [18] uses a supervised learning approach to separate mobile from static nodes.Articles [19], [20] oversee the IoT network through the centralized controller, which enables the collection of observations to design, implement and train a learning agent to maximize the accumulative reward of actions taken.The work in [19] focuses on monitoring the SDWSN traffic, at granularity levels to mitigate flow-table overflows, while [20] uses a learning agent to find the optimal forwarding paths for the SDWSN.Articles [12], [15] aim to improve the performance of TSCH networks.Research work [15] objective is to reduce the packet latency.This is achieved by dividing the slotframe into small chunks.Sensor nodes select the chunk to transmit based on their distance to the border router to minimize the latency.The work in [12], which has been previously introduced, is an autonomous scheduler with little overhead that provides high reliability.The article in [16] main objective is to select among a set of routing algorithms the best that suits the given user requirements.The selection of the routing algorithm is achieved using a supervised learning approach.The research work in [21] presents a RL approach to configure the Carrier-Sense Multiple Access/Collision Avoidance (CSMA/CA) parameters in a multi-hop TSCH network.
The framework utilizes neural networks to converge to Quality of Service (QoS) satisfying configurations efficiently.
Overall, these studies aim to enhance the QoS in network applications.Performance metrics such as energy, delay, and reliability are considered, but there has been limited attention given to real user needs.It is crucial to provide a customized engineering solution that seamlessly manages network resources and orchestrates the network to meet the specific requirements of applications or users.However, as observed from the table, this aspect has not received significant focus.While the work presented in [16] takes user requirements into account in their proposed approach, their supervised learning method encounters challenges in predicting the optimal routing algorithm in dynamic environments like WSNs.Furthermore, they do not consider the MAC layer, which is responsible for the duty cycle of sensor nodes in state-of-the-art WSNs, directly impacting power consumption, delay, and reliability.

III. ELISE FRAMEWORK ARCHITECTURE OVERVIEW
This section presents an overview of the ELISE functional framework and architecture.Next, a detailed description of their components is presented, such as the layered architecture planes.
A high-level overview of the ELISE framework architectural structure, including its main components and interfaces, is shown in Fig. 2. The architectural structure follows the typical three-tier SDN principles for WSNs.The description of each layer of the architecture from the bottom-up is as follows.

A. Data plane
The data plane, also called the infrastructure plane, is built upon the interconnection of multiple Networked Embedded Systems (NES).NES, also named sensor nodes or network nodes, are resource-constrained devices embedded with a processing unit, a memory unit, a communication transceiver, and some form of power supply.They are mainly programmed for a specific task, such as monitoring a physical variable and are often deployed in harsh environments.In ELISE, we have defined two types of NES: the (regular) sensor node and the sink.All sensor nodes, including the sink, communicate with each other using IEEE 802.15.4 radios; however, the sink is also directly connected to the control plane through a wired interface.All NES are nodes of the network graph that are denoted with ν ∈ V and all radio communication links are denoted with ϵ ∈ E, see Sect.II-A for more information.Besides, we assume that sensor nodes are the talkers of network streams and all streams have the same listener node, i.e., the sink or the controller node.
Overall the entire network infrastructure runs on a lightweight embedded operating system.Among the available embedded operating systems in the market [9], we have selected Contiki-NG [22] because 1) it is open-source, well documented, and it has a large community, 2) it is widely used in the research community, 3) it provides the implementation of TSCH and Orchestra, 4) it can run on both Cooja network simulator [23] and real hardware.To comply with SDWSN principles of making the network infrastructure run simple tasks and remove energy-intensive functions from sensor nodes, we have redesigned the protocol stack, from layers three and up, to support the following five functionalities.
1) Data packets: This packet encapsulates the collected data and sends them to the control plane.The packet format is shown in Fig. 3.The cycle sequence and sequence fields are used, by the reinforcement learning algorithm, to keep track of the number of packets received in the corresponding cycle.Temperature, humidity, and light are physical variables measured by sensors (this can be generalized to variables one, two, and three).The ASN field contains the ASN when the packet was created.This field is useful to calculate the packet latency under specific network configurations such as routes, TSCH schedules, etc.
2) Neighbor discovery (ND): This packet discovers other sensor devices in the sender transmission range.It also allows discovering neighbors with paths to the controller.This packet contains three fields: rank, Received Signal Strength Indicator (RSSI), and checksum.The rank field is equivalent to H which is the number of links originated from the talker node (see Sect.II-A).The RSSI field specifies the accumulative RSSI to the controller.This field permits the receiver node to decide  which parent to choose between two equal rank values.Lastly, the checksum field is an error checking of the packet integrity.
3) Neighbor advertisement (NA): This packet contains messages to report their status and neighbors' to the controller, including the average power consumption, rank, and links to neighbors.The format of the packet header is shown in Fig. 4. The payload length field states the number of bytes contained in the packet payload.The sender rank specifies the rank of the sender.The sender power field contains the power consumption of the sender.Cycle sequence and sequence fields fulfill the same function as in the data packet.The CRC field is an error checking of the packet integrity.The payload consists of neighbors' addresses, RSSI, and Link Quality (LQ) values.
4) Network configuration -TSCH schedules: This packet type is a control message to establish the TSCH schedules for the incoming cycle.The header has four fields: payload length, slotframe size |C|, sequence, and CRC.The payload length, sequence, and CRC fields fulfill the same functions mentioned above.The slotframe size field contains the length of the schedules encapsulated in the payload.The format of a TSCH link embedded in the payload of this control packet is shown in Fig. 5.The type field states the type of TSCH link: transmit (Tx) or listen (Rx).The channel and time offset fields specify the coordinates of the given link.The source address field indicates the address of the sensor node that needs to process this link.Lastly, the destination address is used for Tx link types to set the neighbor address.
5) Network configuration -Routes: This packet type is also a control message to establish the forwarding paths for the incoming cycle.The packet header consists of payload length, sequence, and CRC fields which fulfill the same function mentioned above.The packet payload contains the source, destination, and neighbor addresses to build the forwarding paths.
Control packets including TSCH schedules, and route packets are broadcasted into the WSN.ELISE takes advantage of the implementation of TSCH and Orchestra in Contiki-NG to devise a novel approach that enables a run-time reconfiguration and distribution of TSCH schedules and route packets.ELISE defines four slotframes, inspired by Orchestra, for specific traffic planes: EB slotframe for the time source, control traffic slotframe for control packets, data traffic slotframe for data packets, and the default traffic slotframe for any other type of traffic.The control traffic is a broadcast slotframe that permits the transmission and reception of new TSCH schedules and route configurations.
For cell c k j , the time slot number j and channel offset k for sensor node ν i and control plane slotframe size of |C| are calculated as follows.
These two equations minimize communication interference between sensor nodes with equal rank values that try to broadcast control packets to their children.

B. Control plane
The network intelligence resides in this plane.As seen in Fig. 2, the control plane hosts multiple modules for the correct functioning of the framework.The control plane can run locally on a computer or in the cloud.At its core, it is implemented in Python 3 [24] because it supports multiple machine learning libraries available in the market, has a large community, and has an ample library collection.Here, we briefly introduce the core functionalities of each module.
1) Network information and statistics: This module holds all network information collected.It also includes all packets received and simple statistics.
2) Resource manager: This module is in charge of orchestrating all resources in the control plane.Resources in this module include database, serial interface, and network access.
3) Network manager: This module holds key functions to correctly operate the data plane.Functions such as writing and reading from the network reside here.
4) Route manager: This module hosts all functionalities to build the forwarding paths of the network.It hosts multiple traditional routing algorithms.It is also flexible to allow adding new centralized routing protocols.
5) TSCH manager: In this module resides the functions to correctly build TSCH schedules.It has been designed to easily create new TSCH schedulers on top.
6) Reinforcement learning: The reinforcement learning module is the main intelligent component of the entire control plane.It uses all other modules to collect data, learn from the environment, tune hyper-parameters and evaluate the trained agent.This module is discussed in detail in the next section.

C. Application plane
The application plane hosts user requirements and programs such as real-time monitoring tools that convey information regarding the status of the network.This plane instructs the control plane on the current user requirements through the northbound API.

IV. REINFORCEMENT LEARNING MODULE
This section takes the reader through the design and implementation of a reinforcement learning approach to optimize the slotframe size of TSCH in SDN-based IoT networks considering the user requirements.Readers interested in the background of RL can refer to [25] and its applications to SDWSNs can be found in [9].

A. Solution approach
To solve this multi-objective function that self-adapts the slotframe size of the TSCH schedule to the optimal length that best suits a set of user requirements, we use a reinforcement learning agent that hosts neural networks in its core, as shown in Fig. 6.ELISE provides flexibility to evaluate different reinforcement learning algorithms.Algorithms that we consider in this research are Deep Q Network (DQN) [19], Asynchronous Advantage Actor Critic (A2C) [26], and Proximal Policy Optimization (PPO) [27].
The proposed solution follows the typical three-tier principles for SDWSNs, where the control plane layer collects data, orchestrates resources, performs intelligent calculations, and deploys new network configurations into sensor nodes.At the initial state of the data plane, sensor nodes discover their path to the controller using ND packets.They then start sending NA packets to the controller.The controller processes these packets to make future decisions.The reinforcement learning agent predicts the next slotframe size.It then prepares the TSCH schedules and routes using the TSCH and route manager module.Finally, the control plane deploys new configurations through the network management module.
ELISE develops a Markov Decision Process (MDP) framework with the architecture depicted in Fig. 6.This framework enables the reinforcement learning algorithm to dynamically select and deploy optimal actions based on the observations to maximize the average accumulative reward.The MDP is represented by a tuple < S, A, R >, where S represents the state space, A represents the action space, and R represents the immediate reward.
• State Space: As previously discussed, there are three performance metrics, and three user requirements, at the end of each iteration.However, the learning time can be reduced by adding the last scheduled link in the TSCH slotframe (λ), and the current slotframe size (|C|).λ enables the agent to avoid slotframe sizes that are below the last scheduled link in the scheduler, otherwise, it can alter the normal behavior of the TSCH network.The state space of the proposed work is defined as follows.
Where Ω represents the cost of the SDWSN.α, β, and γ are the user-defined coefficients for power consumption, delay, and reliability, respectively.• Action Space: The RL agent aims to find the optimal slotframe size of the data plane traffic plane given the set of user requirements.At every decision-making point, the agent predicts the next slotframe size given the abovementioned observations.The agent can take multiple consecutive actions (slotframe sizes) before reaching the optimal solution.The number of steps taken to reach the optimal solution depends on the current state of the environment, especially, the current slotframe size (|C|) and the user requirements (α, β, γ).
The agent selects the next action between 1) increasing |C|, 2) decreasing |C|, or 3) continuing using the current |C|.The selection of the slotframe size is bounded by numbers that are mutually prime to other slotframes in the TSCH network.Recall that the TSCH network runs multiple stacked slotframes that repeat at different periods (mutually prime slotframe sizes) to ensure they do not interfere evenly.Therefore, the action space for selecting the next slotframe size in the control plane is defined as follows.
Where gcd represents the greatest common divisor, and |C| EB ,|C| CP , and |C| DF are the slotframe sizes of the EB, control, and default traffic planes.
• Immediate reward function: The reward function has been designed to select slotframe sizes within a valid range and to ease learning.Whenever the agent selects a slotframe size below λ, it is penalized.In the other case, whenever the agent selects a slotframe size that goes beyond the maximum valid slotframe size (µ), it is also penalized.
The agent learns to select the next action within this valid range while maximizing the accumulative reward.
The agent is positively rewarded if the slotframe size lies within the valid range.The amount rewarded depends on the performance metrics and user requirements.The reward function is expressed as follows.
Where |C| DP is the slotframe size of the data traffic slotframe, and G max is the maximum penalty for taking an invalid slotframe size.Υ is a constant that makes sure the immediate reward stays always positive.Υ is equal to the worst case of Ω.Therefore, Υ = 2.It is noteworthy that we changed the signs in (7) to maximize the immediate reward function.Also, we have defined two terminating conditions for episodes.We end an episode either every time the agent selects a slotframe size outside the valid range or when we reach the maximum number of timesteps.

B. Cost Function
As discussed in previous sections, it is vital to take into consideration the dynamic changes in user requirements to design a tailored engineering solution that self-adapts to these changes.This approach will allow the framework to selfreconfigure the infrastructure resource that complies with the current state of user requirements while maximizing network performance.This reinforcement learning-based model optimizes the slotframe size of the data plane slotframe introduced in the above sections.Other slotframes are not considered in the optimization because they are control slotframes that are mandatory for the basic operation of the network, and tempering with them can make the network fail.
We now design the overall objective cost function as a combination of several individual objective metrics, capturing the power consumption, stream delay, network reliability, and network dependability forming a multi-objective cost function.Since our main objective is to maximize the network performance given a set of user requirements, we design a weightbased multi-objective function.Specifically, the weights of this function are the set of user requirements.This multi-objective cost function can be expressed as follows.
where ω 1 captures the cost of scheduling and ω 2 captures the cost of routing, both presented in Sect.IV-C.For the cost of scheduling, α, β, and γ are the user requirements for power consumption, delay, and reliability, respectively (see Sect.IV-C for more information).Since we aim at minimizing the cost function, the inverse of the reliability is used which helps to find the maximum reliability cases.They are set by the users, and the overall summation is equal to the unity.ELISE calculates the performance metrics at the end of every cycle; therefore, the samples within the time interval are considered a constraint (they are strictly greater than zero).This holds for all performance metrics samples.The slot duration of TSCH networks can not be less or equal to zero.It is worth mentioning that the optimization is done over the slotframe size of the data plane slotframe of the TSCH protocol.Therefore, the SDN control plane takes several sequential actions at each network state to find the optimal slotframe size given the user requirements and current network state.

C. Objectives
This section gives the details of the terms of the cost function Ω that captures the network performance.The performance metrics are as follows: power consumption, delay, reliability, and dependability.The first three metrics are used in the cost term ω 1 and the dependability is defined with ω 2 .
1) Power consumption: The power consumption is collected from each network node in the WSN.It mainly depends on the energy spent to transmit and receive a packet.In TSCH networks, nodes wake up their radios at specific timeslots to either transmit a packet or listen to the wireless medium, then switch to another state such as low power mode.In ELISE, we assume three states for a network node: i) listening, ii) transmitting iii) listening and transmitting (forwarding).We ignore the idle state of the nodes since we aim at minimizing the workload of the network nodes.
The power consumed for transmitting the network stream τ i ∈ T from the talker to the listener denoted with P i is calculated from Eq. ( 8).The consumed power is normalized in the range [0, 1].
where V i is the power supply voltage of the node ν i (in volts), I i tx is the transmission current consumption of the node ν i and I i rx is the reception current consumption of the node ν i (both in mA).The maximum value of the current consumption while transmitting and receiving is captured by I max .The average power consumption of the network P is defined as the average of the power consumed for transmitting all network streams.The controller receives the NA packets from network nodes and processes the data in the sender power fields of the packets and stores them in the database.When a cycle finishes, the controller retrieves, from the database, the latest power consumption, and calculates the network power consumption cost P using Eq.(9).
where P is the average power consumption in the network, σ P is the normal distribution of the node power consumption in the network, and θ 1 is the weight of the node power consumption distribution.The larger θ 1 value drives the search for a solution with evenly distributed power consumption across the nodes.With θ 1 , the user achieves the desired distribution based on the requirements.
2) Delay: The packet delay is calculated as the interval from when the packet is generated at the talker to when the packet is received by the listener in the control plane.For a TSCH network, we can estimate the packet delay of the stream τ i using the ASN as follows.The packet delay value is normalized in the range of [0, 1].
The control plane then calculates the average delay of the network D using Eq. ( 11) at the end of each cycle.The controller retrieves from the database, the latest D for all network streams τ i ∈ T .
where σ D captures the normal distribution of stream delays.Using θ 2 user can adjust the weight of delay distribution in calculating the average delay.The larger weight drives the search for a solution with evenly distributed stream delays.Similar to θ 1 , the user achieves the desired distribution using θ 2 set.
3) Reliability: To calculate the reliability, we first define the Packet Delivery Ratio (PDR) during a cycle in Eq. (12).The control plane performs the calculation of the PDR as follows.
where τ rx and τ tx are the numbers of received and transmitted packets, respectively.The control plane queries the database to obtain the latest P DR values in the network and calculates the network reliability R at every end of a cycle as follows.R = 1 P DR subject to P DR > 0.
(13) 4) Dependability: We define the dependability metric ω 2 as the cost function for finding the best path for the streams in the network.The dependability metric is based on a simple concept, i.e., the network is more dependable where fewer nodes are involved in the network functioning (stream transmission).Based on this concept, the most dependable network is the one where all streams use the shortest path for transmission.
To this end, we define the power dependability P DEP of the node ν i ∈ V in Eq. ( 14).The power dependability of a node shows how much the node is under the workload, i.e., transmission and receiving the data.The more a node is under the workload, the less dependable is the node.
The power dependability of the node represents the workload of the node which is used for determining the most dependable path for a specific stream.Thus, we calculate the dependability of the stream τ j ∈ T as follows.
With the dependability of the streams in the network, we calculate the average dependability of the network ω 2 using Eq. ( 16). (16)

V. PERFORMANCE EVALUATION
This section tests ELISE's ability to self-adapt the network resources to the configuration that best satisfies the dynamic user requirements.Multiple user requirements are put in place to get insights into the impacts on the performance metrics.We compare the performance of ELISE against Orchestra, which is the TSCH scheduler of choice for IoT networks.Experiments are conducted in both the Contiki Cooja network simulator and a real-world testbed; however, simulation results that have confirmed the same findings as the testbed experiments are omitted due to space constraints.

A. Testbed setup
The experiments are conducted on the premises of the FIT IoT Lab [28].This testbed, which has six different sites across France, offers facilities with hundreds of wireless sensor nodes that allow the evaluation and experimentation of large-scale WSNs.Processors architectures supported include MSP430, STM32, Cortex-A8, and 802.15.4 radios chips running at 800 MHz or 2.4 GHz.They also provide a CLI tool to access the testbed to manage resources and experiments.It also supports a range of embedded operating systems.
We specifically built a 10-sensor-node network (|V| = 10) with a maximum depth of three hops.The topology is shown in Fig. 7.We use this network to not overcomplicate the experiments and to let us draw conclusions from the proof of concept of the framework.We use the IoT-LAB M3 platform for nodes and the sink.This platform has embedded an ARM-Cortex M3 microcontroller (32-bit CPU @72 MHz), an ATMEL radio running at 2.4 GHz (designed for the IEEE 802.15.4 standard), and four sensors (light, pressure & temperature, accelerometer, and gyroscope).This network operates with the Contiki-NG operating system.As aforementioned, we have redesigned the Contiki-NG network stack to support the SDWSN functionalities described in Section III-A and IV, including support for all Orchestra-based slotframes.The control plane communicates with the data plane, specifically the sink node, through the Secure Shell Protocol (SSH).The control collects from and transmits commands to the WSN using this protocol.It is written in Python 3.10, uses Stable Baselines 3 [29] for the RL package, and implements all functionalities previously described in Section III-B and IV.Besides, it operates on a remote computer running macOS Big Sur on an i9 processor of eight cores at 2.3 GHz.The experiment parameters used throughout the results are summarized in Table II.There exist a trade-off between the value for the iteration window interval.A large window interval will filter out noise in the collected data; however, the control plane may not be able to react to dynamic changes in the network.On the other hand, if the interval is small, the number of network reconfigurations increases and so does the noise.This is an open research question in SDWSNs [16].In our experiments we set this value to 60 control packets (|T | = 60); however, this value can be changed to meet user requirements.Although with θ 1 and θ 2 in objective metrics (see Sect.IV-C) the desired distribution of power and delay is achievable, for the evaluation where we have considered a fixed path for streams, we set both θ 1 and θ 2 to zero which remove the effect of power and delay distribution in the search.In future work, ELISE will determine the path for streams.
Before we jump into the evaluation of ELISE, we want to discuss in the next section the challenges found while training the reinforcement learning agent.

B. SDWSN model approximation
The training of the learning model involves taking the collected experience to adjust the weights of the deep neural network.The collected experience is a group of pair stateactions.A group of a pair of state-action is formed at the end of every action taken.The framework selects and deploys an action.It then waits for the cycle to finish to obtain the observations given that action.The time to complete an iteration impacts the total training time of the model.Training the model in the testbed is not feasible as an iteration can take a couple of minutes to complete depending on the frequency of control packets (NA packets) and the window interval.
Although the Cooja network simulator reduces the iteration time to tens of seconds, it is not enough to train the network in a reasonable amount of time.The total training time can be estimated by iter * ts, where iter is the average time of an iteration, and ts is the total number of timesteps e.g. if Cooja takes 30 seconds to complete an iteration then it can take up to five weeks to train for 100k timesteps.For real-world deployment an iteration can take three minutes to complete, depending on the window interval; therefore, it can take up to 3.4 months to train for 50k timesteps.To solve this issue, we mathematically model the TSCH network in the function of the slotframe size for the data plane traffic.This mathematical model allows us to estimate the network cost given a slotframe size.These performance metrics are needed for the immediate reward calculation previously discussed in Section IV-A.This approach reduces the time of an iteration significantly.This approximation model is also useful for hyperparameter tuning, which searches for the best model architecture by creating multiple scenarios with different hyperparameters using pruning strategies.
The main objective of the SDWSN approximation model is to estimate the values of the network average power consumption P , the average network delay D, and the network reliability R when changing the slotframe size |C|.It facilitates the calculation of the immediate reward of an action taken.These values are estimated using the minimum mean square error (MMSE) estimator (E = k j=0 |p(x j ) − y i | 2 ).To obtain them in the function of the slotframe size, we program a simple task in the control plane.It builds and sends TSCH schedules with different slotframe sizes using the TSCH scheduler and network manager of the proposed architecture (see Fig. 2).The controller continuously selects and sends a slotframe size |C| from the set of slotframe size numbers that are mutually prime to other slotframes.We then plot the values for P , D, and R using the 95% confidence interval.We then find the vector coefficients(ζ) that minimize the squared error in the degree    order of four, three, and one for P , D, and R, respectively.We use the testbed setup shown in Fig. 7, and the experiment parameters in Table II.The plot charts in Fig. 8 show the estimated normalized values of the network average power consumption ( P ), delay ( D), and reliability ( R) against the slotframe size (|C|).The P metric decreases exponentially as |C| increases as shown in Fig. 8a.This is expected as when we increase the slotframe size, we increase the number of unused timeslots in the slotframe.Therefore, it reduces the average power consumption in sensor nodes.Fig. 8b shows that the D linearly increases with |C|.This is also expected since the time between links in consecutive slotframes increases, therefore, packets wait longer in the queue to be transmitted.Fig. 8c shows that the R linearly decreases with |C| but at a smaller rate.Links in TSCH networks are very reliable due to their time and frequency diversity; however, they can decay when using a large slotframe size due to the increasing waiting time for retransmissions and congested packet queues.We now use this approximation model to train the DQN, A2C, and PPO algorithms and put them under test in the testbed.

C. Training
Fig. 9 present the learning process of DQN, A2C, and PPO reinforcement learning algorithms running at the RL module of the control plane.It is noteworthy that every episode starts at a random state, this includes random slotframe size.Training for diverse starting slotframe sizes permits the agent to learn how to solve the problem in the presence of multiple states, e.g.new user requirements.Points in the chart represent the average accumulative reward during the last 1000 iterations.It is clear that the convergence performance of PPO is better  PPO seems the best candidate to solve the problem; however, we also evaluate the algorithms' performance by taking solely deterministic actions over 100 episodes to decide on which algorithm to pick.Table III shows that PPO obtained the greatest average accumulative reward followed by A2C and DQN.All algorithms perform well in solving the problem; however, PPO stands out of the three.Therefore, we use PPO for our result analysis in the testbed.We also tune the hyperparameters of the PPO model.These hyperparameters are essential for finding the set of hyperparameters to build the model.Without hyperparameter tuning, our model may produce sub-optimal solutions, as they fail to minimize the loss function.For hyperparameter tuning, we used Optuna [30] which is a hyperparameter optimization framework for machine learning.We tune the hyperparameters for PPO using a random sampler and medium pruner, eight parallel jobs, with 1000 trials and a maximum of 50000 steps.

D. Experimental evaluation
We now evaluate the overall ELISE framework, including the trained agent, in the real-world testbed.We use the testbed setup previously discussed in Section V-A and the trained agent examined in Section V-C.The agent only takes deterministic actions.We designed one single scenario that contains four distinct equally spaced user requirements: balanced, prioritized delay, prioritized power consumption, and prioritized reliability.This scenario allows us to test the agent based on the ability to select the best action given the observations and dynamic user requirements.It also enables us to observe the ability of the agent to switch between different slotframe sizes given a change in the user requirements.The evaluation consists of 10 episodes, where each episode lasts for 160 iterations.Each iteration, which is the size of the window interval, takes approximately four minutes to complete.Therefore, each episode runs for approximately 10.6 hours.The initial slotframe size (|C|) is set to 10 for all episodes.We plot the results individually for each performance metric against the slotframe size and the immediate reward against the slotframe size as shown in Fig 10 .We use the 95% confidence interval for all charts.
1) Balanced SDWSN: We consider the case where the network user puts roughly the same priority on all three performance metrics.Therefore, for this specific requirement, we set values of the user coefficients α, β, and γ as 0.4, 0.3, and 0.3, respectively.The balanced requirement is applied at timestep zero and lasts for 40 timesteps (Zone 1 in Fig. 10).We can see that the trained agent needed three actions to reach the steady state value |C| of 18 on average and that the RL agent balances all performance metrics equally.The network's average power consumption is roughly 4443µW (see Fig. 10a), the network's average delay is almost 220ms (see Fig. 10b), and the network's average reliability is around 0.95 (see Fig. 10c).The network reliability shows a larger distribution in comparison to other performance metrics.This can be attributed to the multiple sources of interference present on the testbed.Also, the relatively small size of the observations affects the reliability; therefore, missing only one packet will reduce the reliability by a few percentage points.The average immediate reward is around 1.13 (see Fig. 10d).The dispersion shown in the network reliability directly affects the distribution of the reward, but not at the same level as in the prioritized reliability case.
2) Prioritized delay: In this case, ELISE users prioritize the network delay over the network reliability and power efficiency.Therefore, we set the user coefficient values α, β, and γ as 0.1, 0.8, and 0.1, respectively.The prioritized delay requirement is applied after 40 timesteps of the start of the episode and lasts for 40 timesteps (Zone 2 in Fig. 10).It can be seen that the RL agent reacts immediately after the user requirements change.The agent self-adapts to these changes and moves the slotframe size from 18 to 11 on average in about three actions.Since the delay is prioritized, we can see that the network average delay is less than in other requirement cases.On the other hand, the network power consumption reaches the maximum across all requirement cases as the slotframe size is the lowest in the entire episode.The network average delay, power consumption, and reliability are 125ms (see Fig. 10b), 4540µW (see Fig. 10a), and 0.95 (see Fig. 10c), respectively.The average immediate reward is 1.13 and has a smaller distribution than in the previous case, as the contribution of  the network reliability in the reward function is less as seen in Fig. 10d.
3) Prioritized power consumption: For this case, the network power efficiency is prioritized over the network delay and reliability.Thus, we set the user coefficient values for α, β, and γ as 0.8, 0.1, and 0.1, respectively.The requirements for this case are applied after 80 timesteps of the start of the episode and it also has a duration of 40 timesteps (Zone 3 in Fig. 10).The network switches from a prioritized delay network configuration to a prioritized power consumption network configuration.The RL agent detects changes in the user requirements (see the immediate reward at timestep 80) and it starts increasing the slotframe size immediately.Specifically, it increases the slotframe size from 11 up to 34 on average.It takes around 13 actions to reach the steady state value.At this point, the network experience less power consumption and higher delays overall.The network average delay, power consumption, and reliability are 470ms (see Fig. 10b), 4367µW (see Fig. 10a), and 0.95 (see Fig. 10c), respectively.The average immediate reward is 1.14 and shows less distribution than the balanced and reliable SDWSN, as the contribution of network reliability is also less in the reward function as shown in Fig. 10d.
4) Prioritized reliability: In this situation, users prioritize network reliability over network power consumption and delay.Therefore, we selected the weights for this case as α = 0.1, β = 0.1 and γ = 0.8.The requirements for this case are applied at timestep 120 and it lasts for 40 timesteps (Zone 4 in Fig. 10).At this time, the network switches from a prioritized power consumption configuration to a prioritized reliability network configuration.The RL agent starts decreasing the slotframe size once it detects a change in the user requirements.The RL agent takes consecutive actions to reduce the slotframe size down to 12 on average.It also takes around 13 actions to reach the steady state value.At this point, it is clear that the network experiences a low delay (145ms in Fig. 10b) and high power consumption (4510µW in Fig. 10a).The network reliability is high with a steady-state value of 0.95 on average (see Fig. 10c).Overall, the network reliability is high for all user requirement cases, but it also shows a large distribution.This can be attributed to the frequency of TSCH schedule updates.At every TSCH schedule update, the  protocol updates its schedules, and there might be packets in the queue waiting to be transmitted.These packets are dropped due to the new schedule.To increase network reliability, users may want to increase the size of the window interval between cycles, at the expense of a less sensitive controller to changes in the environment.Moreover, the distribution of the network reliability can also be attributed to the highly dense testbed, which has multiple platforms and the ability to run multiple experiments at the same time.The distribution of the network reliability also affects the immediate reward; however, the RL agent could successfully select the correct actions despite the noise as seen in Fig. 10d.5) ELISE and Orchestra: In Fig. 11, we present an experimental comparison between the ELISE framework and Orchestra.We consider three performance metrics as previously discussed: network average power consumption, delay, and reliability.We do not look into the immediate reward metric as Orchestra is an autonomous TSCH scheduler.This experimental evaluation considers four distinct user requirements: balanced, prioritized delay, prioritized power consumption, and prioritized reliability.The performance evaluation of network average power consumption in steady-state is shown in Fig. 11a.Orchestra has the largest average power consumption among all cases, and the prioritized power consumption case has the least average power consumption.The ELISE framework and the RL agent continuously adapt the optimal slotframe size to satisfy the user requirements, in this case, minimizing the network power consumption.Orchestra lacks this functionality forcing it to use a fixed slotframe size since the network boot.Even the prioritized delay user requirement presents less power consumption in comparison to Orchestra.This is because, in such a user requirement scenario, the power consumption weighting factor (α) still contributes to the immediate reward function.The network average delay comparison is presented in Fig. 11b.Orchestra presents a similar network average delay to that in the prioritized delay scenario, but the delay in ELISE is relatively smaller.In contrast, the prioritized power consumption requirement has the largest network delay due to the increased slotframe size to reduce power consumption.The network average reliability comparison is shown in Fig. 11c.The network reliability in steady-state, across all user requirement cases and Orchestra, is on average 95%.Orchestra shows slightly better network reliability than ELISE.This can be attributed to the autonomous scheduling and the least number of control packets transmitted.The performance of ELISE at maximizing the network reliability for a prioritized reliability user requirement is not very notorious from the chart; however, ELISE could maintain relatively high network reliability.Also, The change in network reliability is small for relatively short slotframe sizes as shown in Fig. 8c; therefore, it is expected that network reliability will be more affected by slotframe sizes greater than 70.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose ELISE, an open-source framework that utilizes deep reinforcement learning to self-adapt network resources and optimize the slotframe size of TSCH for SDNbased IoT networks, considering dynamic user requirements.We provide a detailed description of all components involved in the framework, including the network manager module for resource orchestration, the network information and statistics module for data collection, the TSCH module for schedule processing, the routing manager module for route processing, and the ML module for hosting reinforcement learning functions.
We design a reward model based on a multi-objective function to select the optimal TSCH slotframe size that best matches the current user requirements.To expedite the training process of the reinforcement learning agent, we mathematically model the TSCH network in terms of the slotframe size.We then train and evaluate multiple state-of-the-art reinforcement learning algorithms to solve the problem.
Finally, the trained agent predicts the next valid slotframe size based on collected observations, and the framework orchestrates resources and deploys new network configurations to sensor nodes.We conduct several experiments to evaluate the performance of ELISE, considering a scenario with four distinct user requirements: balanced, prioritized delay, prioritized power consumption, and prioritized reliability.The tests assess ELISE's ability to self-adapt network resources in response to changes in user requirements.Results demonstrate that the proposed framework can detect changes in user requirements and orchestrate network resources effectively, thereby maximizing overall network performance.
This article also highlights the complexities involved in evaluating TSCH networks using reinforcement learning in realworld deployments.The training phase is laborious and timeconsuming.Although network simulators reduce the iteration time to tens of seconds, it is still insufficient to train the network within a reasonable timeframe.Thus, an approximation model of the network was employed to accelerate the training process.
During each iteration, a few unnecessary network reconfiguration packets are transmitted and lost, leading to a decrease in overall network reliability.Additionally, a few packets are dropped in the TSCH queue due to schedule updates.We plan to address this issue as a future extension of our work, and we intend to further leverage the framework to develop a reinforcement learning scheduler that can autonomously adapt to user requirements

Fig. 1 .
Fig. 1.An example of a TSCH schedule for a six-nodes topology.

Fig. 6 .
Fig. 6.The architecture of the RL agent of ELISE.

Fig. 9 .
Fig. 9. Learning process of DQN, A2C, and PPO for solving the overall objective function.
Network average power consumption over the number of iterations.Network average delay over the number of iterations.Network average reliability over the number of iterations.Network average immediate reward over the number of iterations.

Fig. 11 .
Fig. 11.Experimental comparison between the ELISE framework and Orchestra.
Table I presents a summary of the latest research works on related topics including TSCH, SDWSNs, and RL.It also provides information related to the year of conception,

TABLE III EVALUATION
OF DQN, A2C, AND PPO ALGORITHMS OVER 100 EPISODES.