Deep Reinforcement Learning-Based Deterministic Routing and Scheduling for Mixed-Criticality Flows

Deterministic networking has recently drawn much attention by investigating deterministic flow scheduling. Combined with artificial intelligent (AI) technologies, it can be leveraged as a promising network technology for facilitating automated network configuration in the Industrial Internet of Things (IIoT). However, the stricter requirements of the IIoT have posed significant challenges, that is, deterministic and bounded latency for time-critical applications. This article incorporates deep reinforcement learning (DRL) in cycle specified queuing and forwarding and proposes a DRL-based deterministic flow scheduler (Deep-DFS) to solve the deterministic flow routing and scheduling problem. Novel delay aware network representations, action masking and criticality aware reward function design are proposed to make deep-DFS more scalable and efficient. Simulation experiments are conducted to evaluate the performances of deep-DFS, and the results show that deep-DFS can schedule more flows than the other benchmark methods (heuristic- and AI-based methods).


I. INTRODUCTION
T HE Industrial Internet of Things (IIoT) adopted in man- ufacturing and factory automation is typically implemented by a specialized network for data exchange among sensors, actuators and other production equipment.To facilitate information exchange, industrial networks have evolved over the years and are expected to satisfy the emerging and challenging requirements of the new operation contexts [1].On the one hand, conventional network technologies could not provide deterministic and efficient communications for the industrial needs.To support critical data flows generated by IIoT applications with bounded low latency and low jitter, the IEEE Time-Sensitive Networking (TSN) and the IETF Deterministic Networking (DetNet) work group have been initiated to study timing guarantee for critical traffic.On the other hand, with the huge amount of the ever-increasing IIoT connectivity, the network administrators need to rely on humans to design, configure and manage sophisticated and dynamic industrial scenarios, which is not efficient and sustainable.Next-generation network automation represented by artificial intelligence (AI) based technologies is proposed to tackle this challenge.Along with the advent of the network programmability of 5G networks, the AI-enabled paradigm will carry out the intelligent automated network configuration, optimization and management in the Industry 5.0 era.
Recently, the IETF DetNet working group has been studying deterministic data transmission by incorporating Segment Routing (SR) in Layer 3 to extend TSN technologies for queuing and scheduling, in order to support deterministic bounded latency and jitter for time-critical traffic [2].In particular, regarding the strength of programmability of SR, the working group is currently specifying a Cycle Specified Queuing and Forwarding (CSQF) mechanism [3] to schedule the flows in a more flexible and scalable way, where the forwarding time slot can be specified for the packets, which will increase the bandwidth efficiency.In CSQF, multiple queues in the output port open in a round-robin way and transmission cycles repeat periodically at each port.By defining the segment routing identifiers (SIDs) in IP packets, it can determine the packet routing and forwarding at each hop, specifically, deciding the routing and transmission time slots along the selected path for all packets of time-critical flows, so that the end-to-end latency is controlled in a deterministic way.We refer to it as the Deterministic Flow Routing and Scheduling (DFRS) problem in this paper.To this end, a network controller is required to collect the overall network information for deciding the proper scheduling for flows.
Different kinds of solutions have been proposed to solve the flow scheduling problem: solver-based methods, e.g., an ILP tool [4] or heuristic-based methods, e.g., list-based methods [5].Nevertheless, due to the fact that the high computational complexity of solver-based methods and heuristic-based methods are usually handcrafted with certain expertise, scalable and intelligent scheduling approaches are desired to solve the flow scheduling problem.Therefore, the authors in [6] leveraged deep reinforcement learning (DRL) to solve the time-triggered (TT) flow scheduling problem incrementally.However, the agent trained in [6] was only used to make routing decisions, the transmission cycles for time-critical packets were determined by a heuristic-based method, which would choose the earliest time slots along the path for packet forwarding to minimize the end-to-end (E2E) delay of the TT flow.Nevertheless, the users of time sensitive networks only care about the delay bound guarantee, any earlier delivery of any particular packet is not necessary.In addition, it will cause network congestion to always choose earliest available time slots to minimize the E2E delays without considering different In this paper, we will investigate on the routing and scheduling problem of mixed-criticality DetNet (DN) flows for deterministic performance guarantee and propose a deep reinforcement learning (DRL) based deterministic routing and scheduling approach to solve this problem.Note that, minimizing E2E delay is not the objective of this paper, the proposed approaches schedule the flows with the derived E2E delays near the deadlines without exhausting network resources.In addition, we will also consider the multiple criticalities of the timing requirements of DN flows, i.e., Hard Real-Time (HRT), Soft Real-Time (SRT), and train the DRL agent to make the decision on flow routing and scheduling with the objective of maximizing the number of HRT flows scheduled and the utilities of SRT flows.The contributions of this paper are shown as follows: is devised based on a Markov decision process (MDP) approach for the flow routing and scheduling problem and a branching dueling Q-network (BDQN) is introduced into the framework to derive the optimal policy of the MDP model.• We propose several methods to enhance the efficiency and schedulability of Deep-DFS: 1) we divide a complete flow schedule into multiple simple actions to increase the scheduling scalability; 2) we use a latency aware network representation method to better extract key information for flow routing and scheduling; 3) we introduce action masking to filter invalid actions to avoid too many negative rewards; 4) we design a criticality aware reward function to schedule the flows with different criticalities.• Finally, an extensive performance evaluation is carried out with both single path and multipath scheduling.The results show that, in an incremental scenario, the Deep-DFS scheme can schedule more DN flows than other benchmark methods and multipath scheduling has better performance than single path scheduling.
We introduce some related work in Section II.The system model and problem formulation are presented in Section III.Section IV illustrates the details of BDQN based deterministic flow routing and scheduling methods.The evaluation of the proposed method is discussed in Section V. Finally, we draw some concluding remarks in Section VI.

II. RELATED WORK
In the context of TSN, most studies in the literature focus on the static scheduling and dynamic reconfiguration of gate openings and closings at output ports to satisfy a certain traffic matrix.In this case, routing information is generally given by the spanning tree protocol operating at layer 2. For 802.1Qbv, the disadvantages of the flow-based Time-Aware Shaper (TAS) are limitation of the gate control list (GCL) synthesis solution space and the long time it takes to solve the GCL synthesis in the case of large-scale networks.To solve this problem, the authors in [7] proposed a streambased, class-based TAS without per-flow scheduling, which relaxes the assumption that gate openings for multiple ST queues are enforced to be mutually exclusive.Furthermore, the authors in [8] proposed a more general flexible windowbased scheduling model, i.e., besides the above constraint relaxation, windows do not have to be aligned and can be placed at any time slot on nodes in networks.For network reconfiguration under dynamically changing requirements, the concept of multi-stream gate control for TAS was proposed by [9].The proposed idea enabled runtime reconfiguration of the GCL to avoid reduction in bandwidth utilization irrespective of the burst size and the number of streams.
In case routing can be also decided, e.g., in layer 3, the joint routing and scheduling problem remains a challenge to be tackled.The authors in [10] present a formulation in the integer linear programming (ILP) framework which models the joint routing and scheduling problem for flows of periodic real-time transmissions in converged TSN networks.Network calculusbased flow routing and bandwidth allocation in IP-over-WDM architecture is also investigated to achieve the deterministic data delivery in metro-aggregation networks [11].The authors in [3] also focus on the joint routing and scheduling problem in large-scale deterministic networks using CSQF to maximize traffic acceptance for network planning and online flow admission.Joint routing and network resource allocation for the deterministic service function chaining (SFC) problem was also investigated in [12], where a novel Deterministic SFC Deployment algorithm (Det-SFCD) and an SFC Adjustment algorithm (Det-SFCA) were proposed to ensure deterministic performance during the SFC lifetime.
To the best knowledge of the authors, limited work has been done on AI-based methods for deterministic flow routing and scheduling and the problem of deterministic latency (not minimizing the E2E latency) provisioning upon the mixcriticality flows has been solved.Therefore, we use DRL to solve the deterministic flow routing and scheduling problem from this perspective, which is different from the abovementioned work.

III. SYSTEM MODEL AND PROBLEM FORMULATION
A DetNet system refers to a network comprised of DetNet nodes (i.e., routers), whereby the packet forwarding delay inside a node is deterministic and known through a centralized flow scheduling scheme.In a DetNet system, the delay induced by forwarding a packet comprises of four parts: (i) propagation delay, which is decided by the physical distance between two nodes; (ii) processing delay, which is related to the procedure of receiving the packets and sending them to the upper layers for routing and scheduling decision; (iii) transmission delay, which is the time for putting the packet on the physical link; and (iv) queuing delay, which refers to the waiting time in the queue of the output port because of the accumulation of packets from different input ports to the same output port [13].If the topology information (i.e., distance between any nodes and bandwidth of physical link) is given, the propagation, processing and transmission delay can be assumed to be constant.Thus, to make the overall forwarding delay to be deterministic, the queuing delay should be determined properly in advance, ensuring the sum of delays along the path (endto-end delay) equal to a constant and is within the latency requirements.

A. CSQF enabled DetNet system
Initially, Cycle Queuing and Forwarding (CQF [14], i.e, IEEE 802.1Qch) is proposed as a peristaltic shaper which considers 2 queues on ports to be open and closed alternatively in a cyclic fashion.It divides the time into different cycles with an equal duration T .A packet sent from the precedent node in cycle c must be received during the same cycle in the subsequent node and then transmitted in cycle c + 1.Although CQF can control well the delay over each hop (at most two cycles), the scalability of this mechanism is not enough since it only works well for small networks and assumes perfect synchronization between nodes.
To improve scalability and flexibility, the Cycle Specified Queuing and Forwarding (CSQF) mechanism [15] has been devised as an emerging standard draft from the IETF DetNet working group as the evolution of the CQF mechanism.CSQF is proposed to delay packets with more queues and specify a certain cycle to transmit packets.Inside a CSQF-enabled router, N queues will be equipped in each output port and N D queues out of N (N D ≤ N ) queues are reserved for time-critical traffic, while the remaining Non-critical (NC) queues are for best effort (BE) traffic.These N queues transmit packets in a round-robin fashion, that is, during each cycle, only one queue is active for emitting a packet to the physical link, the other (N −1) inactive queues are closed and enqueue packets for future transmissions.Note that the number of packets that are enqueued in each inactive queue is related with the buffer size of each queue, and improper enqueuing will incur packet loss.The N D time-sensitive queues are dedicated to the time-critical flows by resource reservation.The assignment of packets to specific queues actually decides their transmission cycle, and a packet can be delayed by at most (N − 1) cycles.This assignment can be determined by a centralized controller in advance, while the BE flows without critical timing requirements will not be scheduled in advance by the controller.When the packets of BE flows arrive at each node, they will be directly inserted into the (N −N D ) NC queues, whose queuing delay is not controllable or deterministic.Note that unlike CQF, CSQF operates at layer 3 [15], as it allows to specify the routing and cycle scheduling of packets (e.g., with Segment Routing).

B. DetNet and Flow model
Network topology in this paper is modeled as a directed graph G = (V, E), where V is the set of nodes representing DetNet enabled routers.The nodes are connected with data links represented by the directed edge set E ⊆ V × V.If there exists a physical link between u, v ∈ V , then (u, v), (v, u) ∈ E. Each edge e i = (u, v) ∈ E induces a delay d ei which comprises its propagation delay, transmission delay as well as processing delay.Time is partitioned into cycles of equal duration T , and T represents the set of cycles.One cycle is the minimal scheduling unit that packets can be inserted into and defines the granularity of latency calculation.
A DetNet (DN) flow is defined as a periodic unicast traffic from a source node to a destination node.We denote the set of DN flows as F to be scheduled within the network Since flows are featured with the different period prd k , we define an overall scheduling cycle, which is referred to as hypercycle, so that all network behaviors are the same in each hypercycle.The hypercycle prd s of all flows can be calculated The number of total and time-sensitive queues within a node as the least common multiple of the periods of all flows.Hence, we will discuss the flow scheduling problem in one hypercycle.Actually, the starting time of the hypercycles at different nodes may be not synchronized and there exists an offset between two nodes due to clock drift, which can be measured and known by the controller.For the sake of simplicity and without loss of generality, we assume there is no offset so that all hypercycles are aligned across the networks.Furthermore, considering criticalities in terms of latency requirements, the DN flows can be further classified into: hard real-time (HRT) flows which have strict deadlines, and soft real-time (SRT) flows whose QoS can be downgraded due to deadline violation.Best effort flows, which have no timing requirements, will be also considered in this paper as background flows.Both HRT and SRT flows have the delay bounds D max k and D min k .However, the HRT delay bound is hard, and if the delay bounds are violated, it may result in catastrophic consequences.The scheduling policy must guarantee that all HRT flows in the networks are transmitted within the delay bounds.The SRT deadline is soft, that is, the performance of the SRT flow will degrade if the delay bounds are missed.Similar to [16] [17], a positive utility function is introduced to evaluate the performance of SRT flows, denoted with U k (t), whereby t is the actual experienced E2E delay.If the packets of a SRT flow arrive within the delay bounds, the utility keeps on a predefined positive value (maximal value).The utility function decreases to zero with an actual E2E delay when it goes beyond the delay bounds, as specified by the definition of the utility function U k (t).

C. Deterministic Flow Routing and Scheduling (DFRS): an example
To ensure that no collision or congestion can happen, the controller needs to decide, for each packet, where and when it will be transmitted in each node, i.e., if a packet is sent in the first available cycle or delayed by one or more additional cycles before transmission.
In Fig. 1, we show an example of how a packet is propagated from node A to node D through node B and C. We assume that 1) the link delays of d A,B , d B,C and d C,D are one cycle, two cycles and one cycle, respectively; 2) the period of the flow of interest (FOI) is 2 cycles.Once the packets of FOI are sent from A, they are received at B in the next cycle (e.g., packet 1 is sent in cycle 1 at node A and received in cycle 2 at node B), since the link delay between node A and B is one cycle.Upon receiving packet 1 in cycle 2, the controller can decide to forward packet 1 in the next cycle (cycle 3).However, considering the high traffic load in cycle 3 of Node B, it is better to choose to delay the packet forwarding by 2 cycles (i.e., CSQF offset), that is, packet 1 is forwarded in cycle 4.Then it is received at cycle 6 due to two cycles delay between node B and C. The E2E delay of a packet is calculated as the number of cycles it costs along the path.For example, the E2E of packet 1 is 7 cycles (cycle 8 -cycle 1).What the controller should accomplish is, on the one hand, to ensure that the E2E delay of packets are within the delay bounds of this flow; on the other hand, to avoid the network congestion on certain cycles or edges.

D. DFRS: Multipath Case
For more efficient flow scheduling and higher load balancing, Multipath TCP (MPTCP) technology [18], which allows multiple paths in a single TCP connection by spreading traffic data across multiple parallel sub-flows, has shown great advantages in emerging scenarios where heavy traffic needs to be transmitted.Different from the single-path flow scheduling in Fig. 1, flow splitting and multipath routing are considered in this scenario, where one single communication path is not sufficient to transmit the DN flow and multi communication paths become needed.As shown in Fig. 2, we assume that a TCP connection f 1 is upcoming between Host 1 and Host 2, whose maximum and minimum deadline are 5 and 3 cycles.If the scheduler fails to find a valid schedule along the single path, for example, there is not enough bandwidth for flow 1 within cycles 4-6 in node B, multipath scheduling will be applied to this flow by splitting it into multiple sub-flows evenly.For instance, there are two candidate routing paths {(A, B)} and {(A, C), (C, B)} which can be leveraged for two sub-flows, i.e., f 1.1 , f 1.2 .Eventually, the end-to-end delay of these two sub-flows are 4 and 5 cycles separately, and packets of sub-flows will be reassembled at Host 2 without violating the deterministic delay requirement of flow f 1 .

E. DFRS modelling
For a flow f k to be scheduled, the controller needs to determine a unique feasible scheduled path P, i.e. a sequence of edges (e 1 ,e 2 ,...,e n ) where edge e 0 starts from src k and edge e n ends at dst k .e i and e i+1 are adjacent edges.We should have the constraints: e 0 .src= src k .
(2) e i .dst= e i+1 .src. ( However, at each edge e i , it is impossible to determine the transmission cycles for every packet of this target flow due to the high scheduling complexity.We define an integer variable o k,ei to represent the offset at edge e i for all packets of flow f k , that is, all packets of flow f k arrived at edge e i will be delayed by o k,ei cycles.If we assume that the first packet of flow f k leaves the source node src k at t k 1 , then it will arrive at the next node on the cycle t 1 + d e1 , and will be transmitted again on cycle t k 1 + d e1 + o k,ei , denoted by t k 2 .Therefore, the cycle determination for flow k can be represented by an integer sequence (t k 1 , t k 2 , ..., t k n ), where t k i ∈ Z + indicates the index of cycle that the first packet is supposed to be forwarded at the corresponding node e i .The transmission cycles of the remaining packets can be calculated by t k i + l * prd k , l ∈ {0, 1, ..., n} easily.
Thus, the schedule S k of a DN flow f k from source node src k to destination node dst k can be denoted as An S k is valid for flow f k if the following two conditions hold.
(1) E2E latency constraint: The E2E delay of all packets of HRT flow f k must not exceed the maximum and minimum end-to-end delay bounds (2) Cycle capacity constraint: If the packets of flow f k are decided to be transmitted at edge e i at cycle t, denoted by x t k,ei , the bandwidth capacity bw k in the corresponding cycles will be reserved.Since the bandwidth capacities of cycles are shared among the scheduled flows, the traffic load at any edge e i during any cycle t must not exceed its capacity cap, which is the value of cycle duration T multiplied by link data rate G.This condition is ensured by the constraint: C2:

F. Problem formulation
The flow scheduling problem can be formulated as: given the network topology and DN flows, finding valid schedules (route and cycle allocation) for all flows so that all HRT flows are scheduled and the utility of SRT flows are maximized.We define the variable H k to indicate if the HRT flow f k is successfully scheduled: H k = 1 when HRT flow k is successfully scheduled, 0 otherwise.The Deterministic Flow Routing and Scheduling (DFRS) problem can be defined as follows:

G. Markov Decision Process Based Model
The learning process of Reinforcement learning (RL) is usually modeled as a Markov Decision Process (MDP), with the state space S, the action space A, and the reward R devised as follows.
1) State space: A system state s t represents all the information of the whole network that the RL agent can observe and use to generate a schedule S k for a flow f k .We denote network state s t by extracting the network features from three aspects: 1) topology information, 2) flow information, and 3) network load information.s t can be divided into s t = {s t,e1 , s t,e2 , ..., s t,ei , e i ∈ E} and each s t,ei consists of: • If the edge e i is adjacent to the edge of the former action or to the source node src k .• The distance between this edge and destination node dst k .• The difference between the current selected cycle and d min k , (i.e., minimum delay requirement-passed delay).• The difference between the current selected cycle and d max k .
• The traffic load of this edge, which is denoted as the ratio of the number of available cycles to the total number of cycles.Generally, choosing an edge with lower network load can maintain load balancing and avoid bottleneck links.
• The list of cycle load (in percentage).The state s t will be updated each time after the agent selects a sub-action.2) Action space: Deep-DFS improves the scalability by dividing the schedule of a flow into a set of subactions.That is, the schedule , (e k 2 , t k 2 ), ..., (e k n , t k n )}, where a k i = (e k i , t k i ), of flow k is derived from a sequence of edges and cycles.Specifically, each a k i = (e k i , t k i ) acts as a sub-action, e k i denotes the edge that a flow needs to go through, and t k i represents the cycle that packets are transmitted on e k i .A valid path should satisfy the following conditions: 1) the edges in sub-actions should connect in head-to-tail; 2) the constraint t k i+1 > t k i +d ei should be kept, where d ei is the link delay of e i , to ensure a valid timing.Combined with constraint (4) and ( 5), a valid schedule for flow f k should meet these four constraints.
3) Reward Function: After receiving a sub-action a k i , the agent will obtain a reward value R(a k i ) from the environment based on the effect that this sub-action causes.
The reward of a sub-action a k i will comprise of two parts: 1) how much congestion it brings to the network; 2) whether this sub-action will finalize a complete schedule of a flow, which is also valid.
We define the link utilization rate U ei as the number of cycles which are not available for flows divided by the number of all cycles on link e i where Q t ei represents whether cycle t on e i is fully occupied, |T | and |E| are the total number of cycles considered in one hypercycle and the number of edges in the topology.Furthermore, the global bandwidth utilization ratio U s over all cycles and edges in the network is defined as Besides evaluating the impact of a selected sub-action on the link utilization rate, the cycle utilization rate I t ei should be also considered, which is defined as where P t ei denotes traffic load in cycle t of edge e i .Thus, the overall cycle utilization rate of edge e i can be defined as In order to make the training converge fast, we use α, β, η to adjust the weight of each part making the reward value within (−1, 1), and then the reward of the sub-action a i is denoted as Note that if the usage U ei is larger than the global usage U s after mapping the flow to edge e i , it means a negative reward for this sub-action a i .The same applies for cycle usage I ei and I t ei .The second part takes effect only if the sub-action a k i is the last edge of a valid path for a flow.If the agent finalizes the scheduling of a flow, H k or U k will be calculated according to the flow types.
After selecting the last action a k n of a valid route, we will check if this flow is scheduled successfully (i.e., if the latency, capacity and routing requirement are all satisfied), and then the rewards for the sub-actions R(a k 1 ), R(a k 2 ), ..., R(a k n ) will be updated by adding an extra reward for the second part in a decayed way.Intuitively, earlier sub-actions have a larger exploration space, and thus pose less impact on a valid schedule, while the latter sub-actions are more significant for constituting a valid schedule.

H. Optimization formulation
This paper aims to obtain the optimal deterministic flow routing and scheduling policy, denoted by π * , to maximize the long-term rewards of mapping flows into networks.The DFRS problem can be transferred to the optimization problem which maximizes the expected future discounted rewards as follows: where R(a k i ) is the reshaped reward under the policy π from (12), and the discount factor γ indicates how much the current rewards are valuable than later rewards.
To solve the optimization problem in Eq. ( 13), we propose a branching dueling q-network based deterministic flow routing and scheduling algorithm which will evaluate each action dimension (i.e., edge and cycle selection) separately, and also use action masking to improve the training efficiency.

IV. BRANCHING DUELING Q-NETWORK BASED DETERMINISTIC FLOW ROUTING AND SCHEDULING
To facilitate the learning process for a MDP problem, deep Q-Network (DQN)-based methods are widely leveraged in the literature.For example, Dueling DQN is proposed to eliminate overestimation in the learning process and improve the performance of the double deep Q-network (Double DQN) algorithm [19] [20].By applying a primary neural network Q net as a nonlinear function approximator to select an action, and using a target neural network Q target to estimate the target Q-value of the taken action, Double DQN stabilizes the training process of the RL agent.Dueling DQN further improves Double DQN by using two separate neural networks to estimate the state value and the action advantage, and then the state values and the action advantages are aggregated at the output layer.By doing this, Dueling DQN can perform more robust estimations on state values, which lead to significant improvements on convergence rate and the stability of the learning process.
However, in order to solve the DFRS problem defined in this paper using the Dueling DQN method, several challenges still remain to be tackled: • Unlike other simple scheduling tasks where the action space is featured with only single dimension, the action space of DFRS problem is with two dimensions, i.e., edge and cycle selection, which are independent from each other.However, these two dimensions are synergistic when scheduling DN flows for deterministic performance.
The edge selection should consider the cycle utilization of network and cycle selection also depends on the selected edge/path, which brings a huge challenge to deterministic flow scheduling; • To avoid a large amount of invalid actions for scheduling a DN flow and improve learning efficiency, the size of  the action space, which is proportional with the amount of the edges in the topology and cycles considered in a hypercycle, can be further reduced; • Specialized rewards for flows with different criticalities should be designed so that the DN flows are scheduled with different priorities.Therefore, as shown in Fig. 3, three approaches or techniques are proposed in this section to respond to the challenges mentioned above.Generally, two neural networks are employed, the Primary network for selecting an action, and the Target network for estimating the target Q-value of the taken action.The parameters θ of target networks will be updated from primary networks every certain number of iterations.The experience replay technique is also adopted to stabilize the learning process.In addition, to improve the learning efficiency and accuracy, we carry out the following: • We incorporate the advances of action branching with dueling deep Q-network (Dueling DQN) [21], which is referred to as a branching dueling Q-network (BDQN), to solve the action selection with multiple dimensions; • We use the action masking to block a large amount of invalid actions.A delay-aware masking method is designed to reduce the size of the action space while ensuring enough exploration space; • We design a criticality-aware reward function to entitle different priorities for DN flows with different types, e.g., high priority for HRT flows.

A. Branching Dueling Q-Network
The action branching methods proposed in [22] improve the conventional deep Q-network to solve MDP with multidimensional discrete action space.The main idea is to evaluate the individual action dimension a k,d i ∈ A d , d ∈ {1, ..., M }, e.g., edge and cycle selection in the DFRS problem, while keeping a common state-value estimator between multiple dimensions.Each dimension has a certain degree of autonomy.The structure of BDQN is illustrated in Fig. 3(a), the BDQN further splits the advantage branch into two advantage branches based on Dueling DQN, while keeping a shared representation of the input state.Specifically, the advantage branches correspond to the two dimensions of action in this study, i.e., a k,1 i = e k i , a k,2 i = t k i , each dimension has n d subactions.For example, the n d of the edge dimension is |E|.As shown in Fig. 3(a), a network state s t is input into a shared neural network (yellow block), which will then compute a latent representation which is used for the evaluation of the state value (blue block) V (s t ) and the factorized (statedependent) action advantages (green block) A d (s t , a k,d i ) on the subsequent independent branches.The dimensions of action advantages and state value are aligned, i.e., max{n 1 , ..., n M }.Then state value V (s t ) and the factorized advantages are combined to output the Q-values for each action dimension.Note that, in order to filter the Q-values for the valid subaction of each action dimension, action masking is applied here to accelerate the learning processing.Finally, the subactions with maximal Q-values of each action dimension are selected for the generation of a joint action tuple.
The advantage of each dimension, i.e., A d (s t , a k,d i ) is trained with the common state value V (s t ), and then the Qvalue of each dimension Q d (s t , a k,d i ) is obtained by aggregating the value branch and the corresponding advantage branch as follows.
Then, the action selection follows the ϵ-greedy policy, that is, select a random action with probability ϵ and with probability (1 − ϵ) to select: The TD-target of BDQN is similar with that in dueling DQN to avoid maximization bias, but it is derived by averaging across all dimensions of the action We use Mean Square Error (MSE) to update the parameter θ by the gradient descent method.

B. Action Masking
At the early stage of training, the agent learns to find a valid schedule of a flow in an exploring way.The sub-action is selected by the agent for the action space |E|×|T |, most edges and cycles in the action selected at this stage are not valid for constituting a feasible path, which will slow down the learning efficiency.To avoid the sparse rewards induced by the invalid actions in the training processing, action masking is proposed in this section to block the actions which are obviously invalid to improve the learning efficiency.
Action masking will be applied in two phases of the whole training processing: 1) the action selection phase and 2) the Q value updating phase.We maintain binary lists [a i ], [c j ], where i ∈ E and j ∈ T to filter the invalid actions each time the agent selects a sub-action.a i = 1 if i th edge is adjacent to the edge selected by the former action, otherwise a i = 0.When selecting valid cycles along the path, two aspects should be considered: 1) maximum delay (offset) in each switch and 2) residual available latency budget.Since the queues in a router transmit the packets in a round-robin way, which means each queue transmits packets every N cycles, the packets will be delayed at most (N − 1) cycles in a switch, that is c j = 1 for j ∈ (t, t + N − 1) should be satisfied, where t is the cycle selected for the former sub-action.
Besides the constraint of maximum offset in one node, the cycle selection should also satisfy the end-to-end latency requirement.Therefore, each time the agent selects an action, the actual latency that this flow consumes should not exceed the maximum latency bound D max k .That is, c j = 1 for j ∈ (t, D max k ), otherwise, c j = 0.The list [a i ], [c j ] is recalculated and multiplied by original Q values each time when making a sub-action decision.This narrows down the scope of action selection, the agent chooses an edge and cycle from valid actions that have the highest Q value, which can produce more positive experience and improve the sample efficiency in the training process.
Action masking is also applied in the process of Q-value updating.During the learning period, to avoid the overestimation of the actual Q-value, action selection is based on the Q value of policy network Q(s, a) in the TD-error calculation as (16).In this section, we modify the

C. Criticality-aware reward function
Although the controller can schedule the real-time flows by assigning the cycles for packet transmission along the path with the CSQF mechanism, it cannot always guarantee all real-time flows are scheduled successfully because of the resource competition.Therefore, Deep-DFS should learn the criticalities of different DN flows, so that all HRT flows are scheduled successfully and the total utility for SRT flows is We evaluate the performance of Deep-DFS on a Ladder Network Topology introduced in an Ethernet Consist Network [23], which is an international standard of train communication network.In this paper, the size of the ladder topology is varied from 6 nodes to 10 nodes.Besides the network topology, the DN flows for training and evaluation are both generated as per Table II with the probabilities of 40%, 40% and 20% on the HRT, SRT and BE types, respectively.Note that the flows of F BE are considered as background traffic which has no deadlines.Within each flow, src k and dst k are randomly selected from all nodes in the networks.The packet length (in data units, DUs), period (in cycles) and delay bounds (in cycles) are selected randomly from the set in Table II.In addition, we also set the data rate of all physical links to 1 Gbit/s, and the capacity of each cycle is 100 DUs.The hypercycle is 16 according to periods of all flows.
For the configuration of the BDQN, the parameters are employed based on the common settings for designing neural networks [19], i.e., two-fully connected hidden layers are together deployed with input and output layers.The size of the hidden layers is 128, the size of the output layer is 2, which represents the index of the selected edge and cycle.The mini-batch size is 32 and the discount factor γ is set to 0.5.The initial value of ϵ is 1.0 and decays to its final value 0.01.The agent will be trained and evaluated on the same topology but with different flows, separately.

B. Baseline Methods Compared
To evaluate the effectiveness of Deep-DFS, we compare it with the following baseline methods.1) DRLS: The DRLbased TTEthernet Scheduler [6] outputs edge action step by step to constitute a complete flow schedule, but use a heuristic method to select the earliest available cycles with low degree.
2) HLS: heuristic list scheduler (HLS) selects [5] a time slot which leads to the minimum end-to-end delay on the shortest routing path between the source node and the destination node of a flow.If this minimum delay exceeds the deadline of f k , the scheduler fails to schedule this flow.

C. Incremental Scheduling Scenario
In this scenario, we randomly generate the flow one by one and insert the flow to the network incrementally.If the agent fails to schedule a HRT flow then it stops, and then we compare the maximum number of successfully scheduled HRT flows and the utility of the SRT flows of each solution.
As shown in Fig. 4(a), Deep-DFS can schedule more HRT flows than the other two methods in general, specifically, 14.8% more on average than DRLS, 32.1% more on average than HLS in ladder networks.Since HLS always selects the first available time slot to transmit the packets on the shortest path, it will derive the minimum latency for all flows regardless of the flow criticalities and their delay bounds.Therefore, HLS will saturate the cycles and edges soon, and increase the probability of flow blocking by some fully occupied cycles.DRLS takes into account the cycle usage on the edge, it avoids selecting the cycles with high degree in order to save more bandwidth for the flows with different periods.However, DRLS still tries to select the first available cycle for packet forwarding and minimizes the E2E delay of the flow, which is in conflict with the long-term objective of maximizing the number of flows scheduled in this system.To this end, Deep-DFS redesigns the network state representation and reward function to make delay-aware decisions on edge and cycle selection, so that the HRT flows are prioritized and no bandwidth resources are wasted on minimizing the E2E delay.When the size of the ladder topology becomes larger, Deep-DFS schedules more flows (39.1% more than HLS on average) in the ladder topology with 10 nodes than with 6 nodes (29.3% more than HLS on average).This is because Deep-DFS has more exploration space in a larger topology, while HLS only selects the shortest paths to route the flows regardless of topology size.
We also evaluate the average utility of SHR flows with the percentage of BE traffic.We set the probability of generating BE traffic from 0.2 to 0.36, with HRT and SRT flows are generated with the same probability.As we can see in Fig. 4(b), it is obvious that the average utility will decrease with more BE traffic inserted in the network.With the increasing BE traffic, the network bottleneck will come earlier and the utility of SRH flows in the HLS case will decrease a little faster than the other two methods due to the selection of shortest paths, as discussed above.The link and cycle usages are also shown in Fig. 4(c) and Fig. 5(a).The results show that although the HLS method will lead to more cycles with high traffic load (≥60%), the link usage induced by HLS is lower than the that of Deep-DFS, which is not intuitional.This is because, on the one hand, the resources are exhausted in earlier cycles by minimizing the E2E delay with HLS, though, HLS also stops to schedule flows earlier than Deep-DFS.That makes the overall link usage of HSL lower than Deep-DFS by 21.3% on average in the ladder topology.We also find that the link usage decreases slightly with more nodes in the ladder topology for Deep-DFS, as it prefers to choose a longer route to balance the link load and avoid the bottleneck.

D. Multipath Scheduling Scenario
As shown in Fig. 6, to implement the multipath scheduling with a DRL solution, flow splitting module is needed in the DRL agent.The flow splitting and tagging function and the flow recovery function should be also deployed to map the multipath selection.Upon receiving the request information of flow f k from the host, the DRL agent calculates the action for this flow.If the agent fails to find a valid schedule for flow f k , the flow split module modifies the request information by dividing the size of the flow into 1/q, e.g., 1/2 while keeping other requirements unchanged.Then the q sub-flows are fed into the action selection module again.If they are scheduled successfully, in other words, the q sub-flows are scheduled with different paths while the delay requirements of all subflows are satisfied, the corresponding flow splitting information will be tagged on the packets of this flow, and they will be reassembled in the destination host.Once the agent fails to schedule one of the q sub-flows, it stops.
We set the topology with 10 nodes and generate the flows with an average packet size (in DUs) from 1.5 to 3.5.We also assume that q = 2 and 1 DU is the minimal transmission unit that cannot be split any further in this case.From Fig. 5(b), we can observe that the number of scheduled HRT flows decreases with the increase of average packet size, as highsize packets consume more bandwidth resources.However, compared with a single path schedule, multipath scheduling has better performance in finding valid schedules for HRT flows.In addition, the number of scheduled flows in multipath scheduling decreases more slowly than that of single path scheduling for the reason that the HRT flows have more chance to be scheduled after they are split into multiple sub-flows.We also evaluate the performance of multipath scheduling in terms of the jitter of the SRT flows, as shown in Fig. 5

(c).
As there is no hard deadline for the SRT flows, the jitter of the SRT flows is usually higher than that of the HRT flows.Because once a HRT flow f k is scheduled, the delay jitter of this flow ought to be within (D max k − D min k ).We find that multipath scheduling also outperforms single path scheduling in terms of jitter performance.Furthermore, the jitters of the SRT flows with multipath scheduling can be well controlled within three cycles on average, while the jitters with single path scheduling are much larger, due to the more severe competition for resources in case of only one transmission path.

VI. CONCLUSION
In this study, we proposed a deep reinforcement learning based deterministic flow scheduler (Deep-DFS) to solve the scheduling problem of DetNet (DN) flows with multiple criticalities.We leverage Deep-DFS to determine the routing and cycle selection for DN flows with the Cycle Specified Queuing and Forwarding (CSQF) mechanism, where 1) the timeline is divided into multiple cycles with equal duration and 2) the controller can specify the cycles for packet forwarding so that the end-to-end delay of DN flows can be controlled effectively.To make the proposed Deep-DFS schedule the flows with multiple criticalities, several technologies were proposed to increase scalability and performance.Compared with the other AI-based and heuristic-based methods, Deep-DFS can increase the scheduled flows by 14.8% and 32.1%, respectively.However, it is worth noting that the proposed centralized scheduling approach is merely suitable for scenarios with small network scales, e.g., within a factory network.If the network scale is large, the round-trip time of the signaling between the source node of a flow and the controller becomes also too large, which makes no sense to a time-sensitive flow.Furthermore, runtime re-configuration is also a challenge for a centralized controller.Large-scale deterministic flow scheduling requires a disparate solution, e.g., distributed learning and distributed scheduling approaches, which can facilitate the learning process and network configuration locally.Therefore, the way to design a distributed learning architecture and train distributed learning agents for flow scheduling with segment routing efficiently will be considered in our future work.
where • src k and dst k represent the source and destination nodes of flow f k .• prd k denotes the period of flow f k , which means the source node sends the packets every prd k cycles.• bw k is the total traffic that the source node emits in one cycle.• the delay experienced by packets should be larger than a minimum delay bound D min k and smaller than a maximum delay bound D max k , as then the jitter does not exceed (D max k − D min k ).

Fig. 4 :
Fig. 4: (a) Number of HRT flows scheduled; (b) Utility of SRT flows; (c) Link Usage with different nodes in ladder topology.

Fig. 5 :Fig. 6 :
Fig. 5: (a) Percentage of cycles with traffic load over 60%; (b) Number of HRT flows scheduled under single and multi-path scheduling; (c) Average jitter of SRT flows under single and multi-path scheduling.

TABLE I :
Notation and variablesIndex of cycle of edge e i on which the packets of flow k are transmitted i The ith sub-action for flow k R(a k i ) Reward of the ith sub-action for flow k Ue i Link utilization rate of e i Us Global link usage of topology Q t e i Whether cycle t on e i is fully occupied I t e i Cycle utilization rate of t on e i Ie i Overall cycle utilization rate of edge e i P t e i Traffic load in cycle t of edge e i n d Number of sub-actions of the dth dimension A d (st, a k i ) Advantage of the dth dimension V (st)

TABLE II :
Flow types with different criticalities By conducting simulation experiments of Deep-DFS, we demonstrate the effectiveness of Deep-DFS in maximizing the number of scheduled flows under different network scales as well as flow distributions by comparing it with two benchmark methods.