Reinforcement Learning-Based Power-Saving Algorithm for Video Traffics Considering Network Delay Jitter

Recent studies on energy efficiency and scheduling of power-saving mode have been considered as key technologies for reducing the energy consumption of device-to-device (D2D) communication. Wi-Fi Direct (P2P), one of the key protocols for D2D communication, defines the on-off power saving mechanic called the notice of absence (NoA) power-saving mode that can be applied to the multimedia video traffic. The on-off power saving mechanic enables the user to transmit or receive the real-time video frame during the awake interval in which the video frame rate should meet the requirement. When the user can wholly transmit one video frame before the end time of a required inter-frame interval, it can switch to the sleep mode to save the power consumption. However, the challenge remaining for the NoA method is the fixed length of awake/sleep interval, even if the traffic load is varied. Therefore, in this study, we proposed a reinforcement learning-based power saving (RLPS) method to enhance the performance of the notice of absence (NoA) power-saving mode in Wi-Fi direct with taking the multimedia video transmission and the network delay jitter into consideration. The proposed RLPS method enables the Wi-Fi direct device to dynamically estimate the length of awake interval for transmitting the future video frame in real-time. In addition, the Wi-Fi direct device may wake up too early before the arrival of the video frame, which is caused by the network delay jitter. Thus, the client device has to wait for receiving the video frame. To tackle this challenge, the proposed RLPS method enables the device to predict the start time of awake interval for the purpose of reducing the delay time for receiving the upcoming video frame. Results show that the proposed RLPS method outperforms the existing NoA power-saving mode in terms of the outage probability, energy consumption, and transmission delay of Wi-Fi Direct devices.

INDEX TERMS Wi-Fi direct, opportunistic power saving, NoA, reinforcement learning, network delay jitter.

I. INTRODUCTION
With drastic increase of mobile data communication and 23 emergence of smart devices, it has become an urgent problem 24 to improve system capacity and strengthen the quality of 25 The associate editor coordinating the review of this manuscript and approving it for publication was Aasia Khanum . service (QoS) of users. D2D communication is a key technol-26 ogy for the upcoming future communication systems, which 27 is designed to solve this problem by increasing the system 28 capacity, reducing the transmission delay, and improving the 29 overall spectrum efficiency [1]. However, how to improve 30 energy efficiency is crucial for D2D communication because 31 D2D user typically uses handheld equipment with energy-32 limited battery. it has ability to predict future scenarios, adapt to the net-87 work changing environments, and discover the patterns that 88 a human can miss [9], [10]. Particularly, ML enables the 89 device to learn from its experience without intervention of 90 human. Reinforcement learning (RL) is a branch of the ML 91 algorithm, which enables the agent to deal with the dynamic 92 problem using the trial-and-error strategy. With this strategy, 93 the agent can try a possible solution, get the reward from its 94 environment, and transit to a new state. By trying all possible 95 solutions in all states repeatedly, the agent can learn from its 96 experience, and select the best optimal solution according to 97 the optimal decision-making policy. Hence, RL has shown a 98 great potential to solve non-convex and control problems. 99 With the key features of RL above, various works focusing 100 on the power saving problem in wireless communication 101 have been conducted. [11] proposed a ALOHA and RL-based 102 medium access control (MAC) protocol with Informed 103 Receiving for wireless sensor networks. This method enables 104 a transmitter to inform its receiver about its future slot selec-105 tion so that it can turn off its radio in other slots to reduce 106 energy consumption. [12] used RL for opponent modeling, 107 and proposed a cooperative communication protocol based on 108 received signal strength indicator and node energy consump-109 tion in a competitive context. [13] proposed an RL-based 110 MAC protocol for wireless sensor networks and employed 111 an RL frame to schedule the sleep and active periods of a 112 node to minimize energy consumption. The authors of [14] 113 considered that the sensor node is capable of operating on 114 three modes: 1) transmission, 2) listening, and 3) sleep. In this 115 study, the authors proposed a reinforcement learning (RL) 116 algorithm to select an optimal action based on the decision-117 making policy, where the actions correspond to those three 118 operation modes. In addition, the RL method is employed 119 to adjust the length of sleep and awake interval in order to 120 improve the energy efficiency while guaranteeing the effi-121 cient packet transmission. In [15], P. Verma et al. applied the 122 RL method to handle the lifetime problem in the wireless 123 sensor network because the battery of the sensor node is 124 limited energy and impossible to be recharged. In this study, 125 the RL algorithm enables each node to select an optimal 126 activity by itself, where those activities consist of the sleep, 127 the awake, and the adjustable interval time of sleep/awake 128 to preserve the energy efficiency while ensuring the effective 129 packet transmission. The authors of [16] the authors demon-130 strated that the fixed length of active/sleep period in IEEE 131 802.15.4 standard is not suitable with the topology changes 132 in the dynamic sensor network. Therefore, the authors pre-133 sented the RL-based method to find the optimal duty cycle 134 for the purpose of enhancing the energy efficiency of the 135 IEEE 802.15.4 standard.

136
Since the existing NoA method provides the fixed length 137 of sleep/awake interval, which makes it not suitable for the 138 application i.e., the video streaming, our study proposed the 139 reinforcement learning-based power saving (RLPS) scheme 140 to adjust the length of awake interval according to the varying 141 frame size of the video streaming. In addition, although the 142 video frame is sent based on the scheduled time, it may 143 arrive late at the destination, which is caused by the network 144 delay jitter. The proposed RLPS method enables the receiver 145 VOLUME 10, 2022    Direct devices discover each other by performing a conven-181 tional Wi-Fi scan to negotiate which device will be selected 182 as a group owner (GO). Then, the other devices act as clients, 183 and they are referred to as group members. After Wi-Fi 184 Direct devices discover each other, a power-saving mode is 185 implemented for data transmission.

186
There are two power-saving mechanisms in Wi-Fi Direct, 187 i.e., OPS and NoA modes, as shown in Figure 1. In the 188 OPS mode, the GO periodically sends a beacon message to 189 schedule the transmission time. The beacon message includes 190 the start time of CTWindow, the length and number of 191 CTWindows in a beacon interval, and the power management 192 (PM) bit indicator. The PM bit is used to notify that the client 193 has data to transmit to the GO. The PM bit is set as 0 when 194 the client has data to transmit and as 1 at the end of trans-195 mission. The GO and the clients simultaneously wake up to 196 transmit data during CTWindows. When data are completely 197 transmitted during CTWindows, the client can switch to the 198 sleep mode at the end point of the CTWindow. If the GO does 199 not transmit whole data, the client extends the CTWindow 200 to receive the remaining data. In the NoA power-saving 201 mode, communication is initiated when the GO send a beacon 202 message to a client. The beacon message includes scheduled 203 information, such as the start time of doze mode, and the num-204 ber and length of sleep/awake intervals in a beacon interval. 205 Therefore, the GO and client can simultaneously wake up to 206 exchange data during an awake interval and enter the doze 207 mode to reduce energy consumption. Figure 1 (b) shows that 208 the first and second beacon intervals consist of two and three 209 doze intervals, respectively. Therefore, it should be noted 210 that the NoA power-saving mode provides the flexibility of 211 dynamically adjusting the length of sleep/awake intervals 212 according to traffic load. A video codec with multi-layered GoP structure such as 216 an MPEG-2 video supports various frame sizes. Usually, 217 video frames in multi-layered GoP structure are categorized 218 into three types: Intra(I), Predictive(P), and Bi-directional(B). 219 Video frames are encoded in a sequence referred to as a GoP 220 to reduce their spatial and temporal redundancies [17], [18]. 221 The GoP structure is generally described as MxNy, where x 222 is the number of frames in the P-P or I-P interval and y is 223 the aggregate number of frames in the GoP. Fig. 2 shows the 224 M3N9 GoP structure, which is encoded as IBB-PBB-PBB. 225 In the GoP structure, an I-frame is encoded without referring 226 to any other frames. A P-frame is encoded with referring 227 to the previous P-frame and I-frame. B-frame is encoded 228 with reference to previous I-frame and next P-frame or the 229 previous P-frame and next P-frame. Therefore, frame loss 230 can occur in the following three manners: 1) The loss of 231 I-frame results in loss of all frames in the GoP. 2) If a P-frame 232 is lost, the following P-frame and B-frame are lost. 3) If a 233 B-frame is lost, there is no more frame loss in the GoP. This 234 inter-dependency among frames decides the frame priority 235 with the order of I-, P-, and B-frames.

236
The inter-dependency among I-, P-, and B-frames has been 237 utilized to increase transmission efficiency of power-saving 238  mode in wireless networks. In [19], [20], the authors sug-  Since it is impossible to set the length of an awake interval 250 to be exactly fit into the size of each frame due to variable 251 frame sizes, [19] proposed the algorithm which uses the video 252 frame size distribution. In this work, In this work, the length 253 of awake interval is adjusted dynamically according to the 254 probability density function (pdf) of the size of I-, P-, and 255 B-frames, and the transmission failure rate is controlled by 256 the target value which is decided by the mean and the standard 257 deviation of the pdf functions. In addition, frame transmission 258 strategy based on the frame priority in [19] provided the 259 method to deal with remaining fraction of a frame, so that 260 any remaining fraction of an I-frame (P-frame) that is not 261 delivered during an awake interval is concatenated with the 262 immediately following B-frame and transmitted along with 263 that B-frame during the next awake interval. This method 264 can reduce the outage probability of the I-frame (P-frame). 265 However, the operational procedure of the algorithm in [19] 266 assumed that the probabilistic properties of I-, P-, or B-frame 267 are already known and fixed, which is not realistic scenario. 268 To tackle this issue, [20] proposed an Expectation Maximiza-269 tion (EM)-based power-saving method. In this work, the EM 270 algorithm has been employed to update the scale and the 271 shape parameters of the pdf of the video frame size, and the 272 weights of all the gamma mixture components whenever a 273 frame is transmitted. Based on this estimated pdf, the awake 274 interval for each frame is determined by using the target 275 probability.

276
Although the above two algorithms shows enhanced per-277 formance compared to the NoA power-saving method, they 278 have not considered unique phenomena that may occur in 279 wireless network environments. That is, the above algorithms 280 assumed that each frame arrives at the end device consec-281 utively with equal time intervals. However, even though a 282 video codec periodically sends a frame based on a scheduled 283 time, the frame may arrive either early or late at the desti-284 nation, which results in the network delay jitter. Motivated 285 by this fact, in this study, we propose an RL-based dynamic 286 power-saving (RLPS) method to enhance the performance of 287 the NoA power-saving mode. The proposed algorithm uses 288 the RL algorithm to dynamically adjust the length of awake 289 intervals according to video traffic type and predicts the start 290 point of each awake interval to reduce the effect of network 291 delay jitter.

294
Let T a [k, l] and T s [k, l] denote the start and end points of 295 the awake interval allotted to the l-th frame of the k-th GoP, 296 VOLUME 10, 2022 respectively. The length of this awake interval can be defined (1)

299
In real-time video traffic, a video frame may not arrive at the the l-th frame of the (k + 1)-th GoP can be updated by where |·| is the absolute value, α (0 < α < 1) is the learning 348 whereT s [k, l] andL[k, l] denote the actual served time of 349 the l-th frame in the k-th GoP and the length of the l-th 350 frame in k-th GoP, respectively, β is a scaling factor, which 351 is used to scale the end point of the awake interval. Our 352 proposed RL-based power saving method enables the Wi-Fi 353 direct device to update the end time of awake interval in order 354 to find the optimal T * s [k, l]. The optimal end time of awake 355 can be obtained when the absolute error converges to zero. 356 Thus, the optimal T * s [k, l] is given by Since the end time of awake for receiving the video frame 359 varies according to the length of video frame, the predicting 360 end time of awake may converges to the average served 361 time E[T s [k, l]] when we set the scaling factor β to zero. 362 Therefore, the outage probability that the video frame cannot 363 be wholly transmitted should be high. To avoid this issue, 364 we introduce the variation length of video frame weighted 365 by the scaling factor βL[k, l] to scale up the optimal end 366 time of awake interval. The outage probability is very close to 367 zero, which means the almost all video frames can be wholly 368 transmitted, when we increase the value of β. Therefore, 369 in practice, we should choose the scaling factor based on the 370 requirement of the system. In addition, j may be set as 1, 2, 371 or 3 to decrease, maintain, or increase the end point of the 372 awake interval, respectively.

373
Here, we employ the RL algorithm to select the optimal 374 values of i and j, which are denoted by i * and j * , respec-375 tively. In RL, the problem to solve is described as an Markov 376 Decision Process (MDP). The basic idea of MDP is that 377 the agent in the current state interacts with its environment 378 to take an action according to the policy. As a result, the 379 agent receives a reward and transitions to the next state. 380 From the definition in [22] and [23], an MDP is defined 381 as a tuple S, A, R, P , where S and A are the finite set 382 of states and actions, respectively, R is the reward func-383 tion, and P is the transition probability of moving from 384 the current state S mn [k, l] ∈ S to the next state S m n [k + 385 1, l] ∈ S when using the policy P π (S m n [k [k, l] . The goal of an MDP aims to 394 find an optimal policy π * to maximize the reward function 395 R. The detailed description of the MDP model in our work is 396 given as follows.

397
• Agent: An agent corresponds to a client, which interacts 398 with a GO to adjust the start and end points of the awake 399 interval, and the length of awake intervals. is defined as  and increase the served time to receive the l-th frame in 459 the next k + 1-th GoP. The method to reduce the arrival 460 time and increase the served time is given in (2) and (3), 461 where i and j are set to 1 and 3, respectively. In sum up, 462 the Wi-Fi direct device in the current state has to select 463 an optimal action from the action spaces, which maxi-464 mizes the reward. In our design, the maximum reward 465 is equal to zero (max ij R ij [k, l] = R * ij [k, l] = 0), and 466 it can be achieved when F * a [k, l] = 0 and F * s [k, l] = 0. 467 That means the agent tries to minimize the error between 468 the estimated and actual arrival time compensated by the 469 variety of the network delay jitter, and simultaneously 470 minimizes the error between the estimated and actual 471 served time compensated by the various service time.

472
In our study, we calculate the Q-value function as the function 473 of the current reward and the maximum Q-value function 474 of the previous. In the practical network, when the Wi-Fi 475 direct device transmits the video frame in real-time, it can 476 store the previous information including the start/end time of 477 awake, the length of awake interval, state, action, reward, and 478 Q-value function. Then, the Q-value function can be updated 479 according to that previous information. From [21] and [23], 480 the one-step Q-value of the current action is defined as where γ is the discount factor and max A ij[k−1,l] ∈A Q ij [k − 1, l] 483 is the maximum Q-value of the previous action. From [23], 484 the -greedy policy is used to select the greedy action, 485 A i * j * [k, l], with probability π (A ij [k, l]|S i * j * [k, l]), which is 486 defined as in (7), shown at the bottom of the next page.

487
where v = 9 represents the total number of actions. The 488 proposed RLPS algorithm is summarized in Algorithm 1. As discussion in the section II-B, the loss of I-frame results in 491 loss of all frames in the GoP, and the loss of P-frame results 492 in the loss of the following P-frame and B-frame. But the 493 VOLUME 10, 2022 loss of B-frame does not influence to other frames in the 494 GoP. To reduce the outage probability, the remaining fraction 495 of an I-frame (P-frame) is concatenated with the following next awake interval. In addition, the scaling factor (β) given  This outage probability is expressed as

525
where N I is the total number of I-frames. The loss probability 526 of the P-frame is defined as

528
whereP P is the probability that the P-frame is fully transmit-529 ted. According to the inter-dependency between video frames 530 in the GoP structure and the priority-based frame transmis-531 sion strategy, a current P-frame is completely transmitted 532 when the previous I-frame and P-frame are successfully trans-533 mitted, and the length of the current P-frame is shorter than 534 the following B-frame interval. The successful transmission 535 probability of the P-frame is defined as where N P = n m − 1 N I is the total number of P-frames and 539 L[k, rm + 2] is the length of the awake interval allotted to the 540 combined residual P-frame and B-frame.

541
The outage probability of B-frames is defined as the 542 ratio of the number of lost B-frames to the total number of 543 B-frames, which is given by denotes the number of lost B-frames, N Suc B represents 546 the number of B-frames that are successfully transmitted, and 547 N B is the total number of B-frames. The successful trans-548 mission of B-frames occurs in three cases. First, a B-frame 549 is successfully transmitted when the length of the combined 550 residual I-frame and B-frame is shorter than the length of the 551 awake interval allotted to these combined frames, which is 552 given by 553 N Suc 2] . (14) 554 Second, a B-frame is fully transmitted if the I-frame and 555 previous P-frame are successfully transmitted and the length 556 of the combined residual P-frame and B-frame is shorter than 557 that of the awake interval allotted to these frames. In this case, 558 the number of B-frames that are successfully transmitted is 559 given by 560 N Suc Finally, a B-frame that is not combined with the residual 566 I-frame (or P-frame) is successfully transmitted when the 567 length of this B-frame is shorter than its awake interval and 568 the previous I-frame and P-frame are fully transmitted. Thus, 569 the number of B-frames that are successfully transmitted is 570 calculated as in (16), as shown at the bottom of the next 571 page. Finally, the total number of B-frames that can be fully 572 transmitted is defined as Therefore, the outage probability of B-frames can be derived 575 using (13)- (17). given in the section III-C. The average energy consumption 610 is defined as the sum of the energy consumed during the 611 awake and sleep intervals plus an additional energy, which 612 is consumed to switch from the sleep to the awake mode.

613
The power saving method enables the Wi-Fi direct device 614 to power off the circuitry to save the energy consumption. 615 Thus, the energy consumption varies according to the length 616 of sleep/awake interval. The metric ''energy consumption'' 617 is very significant because we can calculate how much the 618 device consumes the energy for receiving one video stream in 619 real-time. Most importantly, we can verify that our proposed 620 algorithm can reduce much energy consumption of the device 621 compared to the traditional NoA power-saving method. The 622 method to obtain the energy consumption is described as 623 follows. The average energy consumption is defined as the 624 sum of the energy consumed during the awake and sleep 625 intervals, and the additional energy used to switch from the 626 sleep to the awake mode. The average energy consumption 627 during an awake interval is defined as The average energy consumption of frame transmission is 634 defined as where E switch is the total energy required to switch frames 637 from the sleep mode to the awake mode.

639
For performance evaluation, we decoded the movie titled 640 'Jurassic World (2015)' using the Elecard StreamEye Studio 641 software, which is a video quality test software for the 642 analysis of stream structures and the inspection of code 643 parameters [24]. The GoP structure for this video is encoded 644 as M3N30. The standard of this video is MPEG-2, which 645 requires a frame rate of 24fps to support a resolution of 646 1920 × 1080 [25]. It is noted that the frame rate of 24fps is 647    In addition, we aim to scale down the previous Q-value 667 to avoid the problem of divergence to negative infinity.

668
All parameters used in the simulation runs were summarized 669 in Table 1.  The increasing energy consumption is caused by scaling up 692 the length of awake interval, which results in reducing the 693 outage probability because most of the video frames can be 694 wholly transmitted during the current awake interval. Fur-695 thermore, scaling up the awake interval length also decreases 696 the average delay because the number of the residual frames 697 that have to wait to be transmitted during the next awake 698 interval is decreased. Fig. 4 shows the average delay and 699 outage probability of a frame for the movies titled 'Amazing 700 Mary Gifted' and 'Jurassic World', assuming the same sim-701 ulation parameters. Since 'Jurassic World' has more active 702 scenes than 'Amazing Mary Gifted', the average frame size of 703 'Jurassic World' is larger than that of 'Amazing Mary Gifted', 704 as shown in Table 2. As a result, Fig. 4 shows that the average 705 delay and outage probability of 'Amazing Mary Gifted' is 706 lower than that of 'Jurassic World'. 707 Figure 5 shows the comparison between the outage prob-708 abilities of I-, P-, and B-frames as a function of scaling 709 factor (β). Since the loss of an I-frame results in the loss 710 of all frames in the GoP, the outage probability of I-frames 711 is lower than that of P-frames and B-frames. The loss of 712 B-frames is caused by the loss of P-frames; hence, the outage 713 probability of B-frames is higher than that of P-frames. The 714 proposed RLPS method uses a coefficient to scale the length 715 of awake intervals, which increases with the value of the 716 coefficient. Hence, the outage probability of frame trans-717 mission decreases as the scaling factor increases, as shown 718 in Figure 5a. Figure 5b shows the average transmission 719 delay as a function of the scaling factor. As the scaling 720 factor increases, the transmission delay of a frame decreases 721 because the length of awake intervals increases. Figure 5c 722 shows the average energy consumption as a function of the 723 scaling factor. The energy consumption increases as the scal-724 ing factor increases because the energy consumption during 725 awake intervals is considerably higher than that during sleep 726 intervals. 727 92954 VOLUME 10, 2022  Fig. 8 shows the comparable results between the proposed 774 RL and NoA power-saving methods in terms of the average 775 delay of a frame and the frame rate. In this simulation, 776 we assume that the frame rate of the video varies from 16 to 777 32 frames per second. Here, it is noted that when the number 778 of frame transmissions per second increases, the length of an 779 inter-frame interval decrease, which results in reducing the 780 time delay for transmitting the residual I-frame and P-frame. 781 Therefore, the average delay of a frame decreases as the num-782 ber of frame transmissions per second increases, as shown in 783 Fig. 8 below. Most importantly, the simulation result verifies 784 that the performance of the proposed RL method is better than 785 that of the existing NoA method. 786 Fig. 9 show the comparison result of the proposed RL and 787 exiting NoA methods in terms of the average delay jitter and 788 delay factor λ. In our study, we assume that the actual frame 789 arrival time may shift forward with a UDP-jitter varying from 790 3.8 to 4.4 ms [29]. Since the existing NoA power-saving 791 method fixes the start time of awake, the average jitter delay 792 of this method is equal to 4.0993 ms. The delay factor λ here 793 is used to compensate for the random delay jitter. Thus, the 794 delay jitter varies according to the value of λ. Our proposed 795 method aims to reduce the start time of awake when we 796 increase the value of the delay factor to ensure that the device 797 can wakes-up before the video frame arrives the destination. 798 Thus, the delay jitter decreases as the delay factor increases. 799 According to the result illustrated in Fig 9, we can verify that 800 the delay jitter of the proposed method is less than that of the 801 existing NoA method. 802 Fig. 10 shows the comparison results between the proposed 803 RL, NoA, and EM methods in terms of the average delay and 804 energy consumption of a frame. The results verify that the 805 performance of the proposed algorithm is better than that of 806 VOLUME 10, 2022    intervals scheduled for the video frame transmissions in a 812 same class are equal to each other. Second, the study in [20] 813 only focuses on the method to regulate the length of awake 814 interval without considering the network delay jitter.