A Deep Reinforcement Learning-Based Offloading Scheme for Multi-Access Edge Computing-Supported eXtended Reality Systems

In recent years, eXtended Reality (XR) applications have been widely employed in various scenarios, e.g., health care, education, manufacturing, etc. Such application are now easily accessible via mobile phones, tablets, or wearable devices. However, such devices normally suffer from constraints in terms of battery capacity and processing power, limiting the range of applications supported or lowering Quality of Experience. One effective way to address these issues is to offload the computation tasks to the edge servers that are deployed at the network edges, e.g., base stations or WiFi access point, etc. This communication fashion, also named as Multi-access Edge Computing (MEC), is proposed to overcome the limitations in terms of long latency due to long propagation distance of traditional cloud computing approach. XR devices, that are limited in computation resources and energy, can then benefit from offloading the computation intensive tasks to MEC servers. However, as XR applications are comprised of multiple tasks with variety of requirements in terms of latency and energy consumption, it is important to make decision whether one task should be offloaded to MEC server or not. This paper proposes a Deep Reinforcement Learning-based offloading scheme for XR devices (DRLXR). The proposed scheme is used to train and derive the close-to optimal offloading decision whereas optimizing a utility function optimization equation that considers both energy consumption and execution delay at XR devices. The simulation results show how our proposed scheme outperforms the other counterparts in terms of total execution latency and energy consumption.


I. INTRODUCTION
T HE eXtended Reality (XR) applications benefit from the latest developments in 5G and beyond network communications. XR can be defined as the combination of virtual 3D objects with real world content [1] consumed via smart devices such as handheld smart phones or head mounted glasses 1 Depending on the balance between the amount of virtual content and reality, XR is denoted as Augmented Reality (AR), Mixed Reality (MR), or Virtual Reality (VR). However, regardless of the labeling, there is an exponential increase in XR applications in various scenarios, including in health care [2], tourism [3], education, and manufacturing. Fig. 1 illustrates a generic XR system, with the following essential components: r Input sensors that acquire information via various type of built-in or companion sensors, such as: gyroscope, location, cameras, etc.
r Processing modules are responsible for processing the collected data, which is performed locally or via offloading to a cloud server, fog server or edge server, depending on the required computational complexity and available processing power.
r Outputs refer to post-processing actions that involve the XR content display, including streaming of high definition video content [5], activating actuators and interaction with external devices. This stage uses head-mounted displays (HMD) [6], handheld displays [7] and/or devices such as haptic gloves, olfaction dispensers [8], etc. Despite the latest fast pace of hardware design and development, the mobile devices used for XR applications are still limited in terms of resources in comparison to desktops or servers. The cost of high mobility and reduced size is paid in terms of battery capacity and processing power. hand, due to the complex algorithms used mostly in relation to video content processing, XR applications require high computational resources. An effective way to cope with the challenge of supporting immersive XR applications run on resource-limited mobile devices is to offload the computation via the network to resource-rich devices, such as cloud or edge servers.
Cloud computing has been a successful new computing paradigm. Its intrinsic idea is the centralization of computing, storage and network management in the cloud, providing support via data centers, backbone networks and cellular core networks [9], [10]. In order to execute computation in the cloud, the mobile devices and servers are required to operate offloading frameworks, such as MAUI [11], or ThinkAir [12]. However, recently, the function of cloud computing is being increasingly moved towards the network edges, closer to user devices [13]. By harvesting the idle computation power and storage space distributed at network edges, sufficient support is made available to XR applications to perform computation-intensive and latencycritical tasks at user mobile devices. This principle is behind the Multi-Access Edge Computing (MEC) [14] paradigm, in which mobile devices can communicate and get support from MEC servers via multiple wireless communications technologies such as LTE, 5G, WiFi or a combination of them [15]. The general architecture of a MEC system is illustrated in Fig. 2.
In a MEC-enhanced cloud computing context, the challenge remains to decide which XR processing-related tasks are to be offloaded and where, in order to best balance XR application requirements on one hand and make efficient use of device, MEC and cloud computational, storage and network resources, on the other hand. This is not trivial and diverse solutions were proposed using heuristic or complex optimisation approaches [16].
This paper proposes a Deep Reinforcement Learning-based offloading scheme for XR applications (DRLXR) that distributes the computation between device, MEC and cloud in order to best balance the XR application performance and energy efficiency in given networked system resource constraints.
The contributions of this paper are as follows: r A three-layer architecture for XR systems is proposed, and the energy-efficient computation offloading issue to minimize the overall power consumption while satisfying the stringent delay constraints of XR applications is focused on.
r The problem is formulated by using the Markov Decision Process (MDP) framework and the close-to-optimal offloading decision making is derived via a Deep Reinforcement Learning (DRL) technique. The XR applications are decomposed into small tasks and are represented using Graph Theory.
r Finally, the proposed DRLXR solution is evaluated using Network Simulator NS-3 and Open Gym AI library and is benchmarked against other novel offloading schemes. The rest of this paper is organized as follows: Section II surveys some novel offloading schemes found in the research literature. The technical background of the Deep Reinforcement Learning (DRL) is discussed in Section III. Section IV provides details about our proposed solution, including system architecture, problem formulation and the DRL-based offloading algorithm. We evaluate the proposed scheme in a simulation environment and discuss the results in Section V. Finally, the paper is concluded in Section VI.

II. RELATED WORKS
This sections discusses some state-of-the-art offloading schemes proposed in the research literature. According to their type of offloading, four main groups of such schemes are considered: i) binary offloading, ii) partial offloading, iii) stochastic model-based and iv) deep learning-based offloading schemes.

A. Binary Offloading
Kumar et. al., [17] provided guidelines for making offloading decisions with the aim to minimize both computation latency and energy consumption for mobile devices in a traditional cloud computing fashion. The key factors that are considered for offloading include: CPU speed at mobile devices and cloud servers, data size, and fixed rate of wireless communication links. However, the assumptions made in this paper are not realistic. The channel gain of wireless communication is time-varying. Besides, the CPU power consumption increases in proportional to CPU cycle frequency. So, adaptive offloading schemes are necessary to overcome such limitations.
The authors of [18] and [19] employed an optimization framework to formulate the offloading decision with the aim to minimize energy consumption. In [18], the researchers considered multimedia applications, which require the task to be completed within the deadline with a given probability τ . The offloading decisions are made following which computation modes (either local computing or offloading) incur less energy consumption. Internet of Things (IoT) systems where sensor nodes are powered using wireless power transfer (WPT) technology are considered in [19]. Alongside reducing the energy consumption, the optimization proposed in [19] also aims to maximize the computation rate of all network nodes.
In reality, mobile applications normally consist of multiple procedures/functions/components, like the components of the XR system illustrated in Fig. 1. In this case, offloading the whole program or completely performing local execution as suggested by binary offloading is not suitable.

B. Partial Offloading
Partial offloading of tasks refers to the decomposition of one application into two parts: one offloaded to edge servers and the other one executed locally at the mobile device. Kao et al. in [20] modeled the dependency between different procedures/components of an application by using a Directed Acyclic Graph (DAG). Next, the balance between energy consumption and delay is formulated via an optimization equation. Saleem et al. [21] studied the problem of minimizing latency by considering the local energy constraint, while taking into account the limited energy availability at the user. This has a high impact on the data segmentation decision. Despite the manifold benefits, such partial offloading schemes are not examined under timevarying radio communications channels, such as poor channel conditions and scarce bandwidth may affect the offloading latency. In such case, multiuser cooperative edge computing can be considered as a promising solution, where proximal devices can collaborate with each other to scale up the services. An approach that combines MEC and Device-To-Device (D2D) communications is proposed in [22]. Based on monitoring the interference on the radio communications link, a device can decide to offload task execution to the edge server, to another nearby device, or execute it locally. [23] proposed a joint solution based on Mixed-Integer Nonlinear Programming (MINLP) that considered multi-task partial computation offloading and network flow scheduling problem in multi-hop network environments. The output of the proposed optimization problem is a partial offloading ratio.

C. Stochastic Task Model-Based Offloading
Hong and Kim [24], Zhang et al. [25], Zheng et al. [26], and Ren et al. [27] are among solutions that consider stochastic task models that are characterized by random task arrivals. In [24], the problem of minimizing long-term execution cost was solved via jointly optimizing computation latency and energy consumption. The proposed scheme employed a semi-MDP framework to control local CPU frequency, modulation scheme and data rates. Zhang et al. [25] proposed an optimization based offloading scheme for unmanned aerial vehicle (UAV) systems that aim to minimize the energy consumption subject to the constraints on the number of offloading computational tasks. These tasks were assumed to arrive in stochastic manner and be independent and identically distributed (iid). In [26], the problem of stochastic computation offloading is formulated by using the MDP framework and solved via using Q-learning algorithm. A joint solution that combines channel allocation and resource management for making offloading decision (JCRM) with the aim to maximize network utility was proposed in [27].
JCRM then leverages the Lyapunov optimization technique to make optimal offloading decisions.

D. Reinforcement Learning-Based Offloading
Since there is limited training data and novel applications appear continually, supervised learning becomes difficult for feature learning. Although unsupervised learning is promising to exploit the features of network traffic, it is challenging to achieve real-time processing [28]. On the other hand, reinforcement learning paradigm can be used without having access to a pre-existing data set for training. Training can be achieved via direct interaction between learning agent and surrounding environment.
Li et al. [29], Hu et al. [30], and Ning et al. [31] proposed to make use of reinforcement learning and/or combine it with deep learning in order to propose diverse offloading schemes for MEC-enhanced Internet of Vehicle (IoV) systems. In [29], Li et. al., proposed an online reinforcement learning method from the feedback and traffic patterns to balance traffic loads. In order to fulfil high-efficient traffic management, a joint communication, caching and computing problem was investigated in [30]. [31] proposed an offloading scheme that addressed the trade-off between energy consumption and delay for IoV system. The RL based solution were then employed to derive offloading strategy for IoV nodes.
Min et al. [32] considered IoT nodes that are powered following energy harvesting. The proposed scheme allowed IoT devices to select the edge server and offloading rate based on current battery level and previously monitored radio transmission rate. DRL was employed to improve the offloading performance in a highly complex state space.
Wang et al. [33] transformed the original joint computation offloading and content caching issue into a convex problem then solved it in a distributed and efficient way. Hao et al. [34] considered the offloading problem that takes into account both constraint of computing and storage capacity of mobile devices when optimizing the long term latency. The proposed scheme was formulated by using DRL and the solution proposed showed noticeable results in terms of convergence time and latency reduction.
Wang et al. [35] proposed a Meta Reinforcement Learningbased scheme (MRLCO) to provide optimal offloading decision for User Equipment (UE). Mobile applications are modeled as Directed Acyclic Graphs (DAG). The author employs Meta Reinforcement Learning (MRL) in order to find the close-tooptimal offloading decisions for UEs with the aim to reduce latency. UE applications are defragmented into multiple subtasks. Each sub-task is then decided to be processed locally or offloaded to a virtual machine at MEC server. MRLCO outperforms the other baseline algorithms in terms of average latency. The main disadvantage of MRLCO is the lack of UE mobility and energy consumption consideration.
Despite pursuing different avenues, most of the existing works did not consider a holistic approach that takes into account the complexity of the latest applications, such as the XR ones. These applications comprise of many small tasks and their performance is influenced jointly by network conditions and energy consumption. This gap is bridged in this article.

III. TECHNICAL BACKGROUND
This section briefly discusses the background related to Markov Decision Process (MDP) and Deep Reinforcement Learning (DLR), techniques used in the proposed solution.

A. Deep Reinforcement Learning
DRL is a research area of machine learning that combines Deep Neural Network and Reinforcement Learning (RL). Deep learning enables RL to scale problems that were previously intractable, i.e., the environment with a high dimensional state and large action spaces. Some successful applications of DRL include video games, robotics, etc.
In general, DRL can be formulated as an Markov Decision Process (MDP) framework using a tuple S, A, P, R, γ , where: r S is a finite set of states r A is a finite set of actions r P is a state transition probability matrix, MDP uses a definition of total expected return, return G t , or the total discounted reward from time-step t as follows: A policy π in an MDP is a distribution over actions given states: The goal of MDP is to derive an optimal policy π * (a|s) = P [A t = a|S t = s], which is a distribution of actions in corresponding states, so as to maximize the total discounted cumulative reward.
In general, there are two main approaches to solving RL problems: value function based and policy search based methods.
1) Value functions methods are based on estimating the value (or expected return) of being in a given state. The state-value function v π (s) is the expected return when starting from state s and following policy π: The optimal policy, denoted as π * , has a corresponding state value function v * (s) that is defined as: If we know the value of v * (s), the optimal policy can be derived by choosing among all available actions in state s t and picking the action a that maximizes E s t+1 ∼P(s t+1 |s t ,a) . In a RL environment, as the state transition probability matrix P is not available, another function, state-action value function q π (s, a) is constructed as follows: The best policy, given q π (s, a) can be found by choosing a greedily in every state: argmax a q π (s, a). Under this policy, the value v π (s) can be derived by maximizing q π (s, a): v π (s) = max a q π (s, a) 2) Policy search methods do not maintain a value function model, but directly search for an optimal policy π * . In general, a parameterized policy π θ is chosen, where parameters θ are updated to maximize the expected return E[R|θ] using either gradient-based or gradient-free optimization [36]. Gradient-free methods find the best policy via using heuristic search across a predefined class of models. For gradient-based learning, the gradient can be estimated [37]. In order to combine the advantages of value function and policy search methods, a hybrid solution that employs both value functions and policy search, named Actor-Critic [38], was introduced. Actor-Critic method combines a value function with an explicit representation of the policy, resulting in actor-critic methods, as shown in Fig. 3. The actor (policy) learns by using feedback from the critic (value function). Actor-Critic methods use the value function as the baseline for policy gradients, so that the only fundamental difference between the actor-critic method and other baseline methods is that the actor-critic method utilizes a learnt value function. Some advantages of Actor-Critic methods [38] include: i) they require minimum computation in selecting actions in comparison to the other two methods; ii) they can learn an explicitly stochastic policy or optimal probabilities of selecting various actions. Due to these advantages, the Actor-Critic method is employed as a decision maker for XR device task offloading. This is discussed in details in the next section.

IV. PROBLEM FORMULATION
This section discusses the proposed offloading scheme. First, the system architecture is described and then the details of problem formulation based on MDP are provided. Finally, the DLR-based offloading scheme is introduced in details. We include all the abbreviations used in this paper in Table I.
In order to evaluate and compare our proposed schemes against another algorithms, we use the following metrics: r Average energy consumption (in Joules) over all devices r Average total completion time of tasks

A. System Architecture
The general architecture of the MEC-enhanced network system is considered to consist of three levels: core network, edge network, and XR devices, as illustrated in Fig. 4.
The Operations Support System (OSS) and Multi-Access Edge Orchestration (MEO) are located at the top core network level. The OSS block is responsible for receiving requests from customers, and determines requests granting, sending the requests to MEO. MEO maintains an overall view of the MECbased system, knowing the available resources, services and deployed MEC hosts, and it also monitors the topology. MEO also selects the best hosts where to deploy an application, considering available resources, services availability and constraints such as latency.
At the Edge Network level, the major components are MEC Server, and MEC Platform. The latter is responsible for managing the life cycle of both applications and MEC platforms, informing the MEO if any relevant event happens. MEC platform Finally, at the bottom are XR devices that are running high computation-intensive applications, such as: deep learningbased object detection, 360 • video streaming, etc. and need to offload some tasks to MEC servers.
Next, the block diagram for MEC server and XR devices, as illustrated in Fig. 6  r At MEC server: The Data Aggregation part collects all the requests from all devices from Radio Transmission Unit in the vicinity then feeds them into Traffic Management block. The Traffic Management block manages all the Virtual Machine (VM) and assigned resources for corresponding mobile device's requests. All requests are then processed and the responds are sent back to XR devices via Remote execution service block. MEC sever also has connections to Remote Cloud servers, but in the scope of this paper, we ignore the effect of such communications.

1) Multitasking Application Modelling:
In this paper, we assume that an XR device is executing a resource-hungry multitasking XR application by offloading some sub-tasks to the  MEC server. Such offloading decisions aim to minimize the device's energy consumption, whereas the predefined stringent requirements of completion time of the application are met.
A multitasking application can be decomposed into a set of fine granularity atomic non-preemtive tasks. We use a Directed Acyclic Graph (DAG) to formulate the dependencies between these tasks. Denote G = (V, E) as the construction of multitasking, where V is the tasks and E refers to the dependencies. The total number of tasks of the application is N = |V |.
Depending on how developers model the applications [39] [40], there are, in general, three types of multitasking DAG: i) Sequential, ii) Parallel, and iii) General dependencies. Due to their simplicity, the Sequential and Parallel models cannot reflect the complexity of dependencies between sub-tasks of an XR application. Therefore, in this paper, we consider a general dependencies model for XR applications, as illustrated in Fig. 7. Each node from 1 to N = |V | represents a computation task of the application that can be executed locally or offloaded to the MEC server. Normally, for an XR initiated application, the first and last steps (i. e. 1 and N ), which receive I/O data and display the final results on the device screen, respectively, must be executed at the XR device. XR devices decide for the tasks associated with the remaining nodes (i.e. from 2 to N − 1) if they will be offloaded or executed locally. The tasks that are being offloaded to MEC server is highlighted in blue whereas the pink ones refers to the tasks that are executed locally at XR devices.
2) Energy Consumption Model: In general, the energy consumption of mobile device can be decomposed into four parts: r The energy consumption by the local CPU due to local processing, denoted as processing . r The energy consumed by wireless network interface when uploading to remote servers code source and data of offloaded tasks, denoted as up . r The energy consumed by wireless network interface when downloading task execution results from MEC servers, denoted as down r The energy consumed by wireless network interface when it is in idle mode. This mode is enabled when the mobile device is waiting for the execution of offloaded tasks, denoted as idle . Using the model from [40], [41] and following the previous considerations, the energy consumption for task t can be derived as follows: In case task t is executed locally, we have up = down = 0. By summing up, the total energy consumption E of the application with n tasks is: 3) Completion Time: When the computation is executed locally, it will utilize the computing resources of the mobile device, including CPU, memory, storage, battery capacity, etc. Denote CPU cycle frequency as f m , task input-data size as L (bit), computation workload/intensity X (in CPU cycles per bit), the execution latency for local processing for task t is: For the task that is offloaded to the MEC server, the time spent on transferring data is calculated as follows: The completion of an application is obtained when the final task n = |V | is executed. We use T to refer to the processing duration of all application tasks, plus transmission time to/from the MEC server.
where x t denotes the offloading decision at time t. x t = 1 refers to the offloading of the task at time t to a MEC server, and x t = 0 indicates local task execution at the level of the XR device.
In order to meet the strict deadline τ max , we have the condition: The utility function that takes into account the energy consumption and completion time is derived as follows: whereẼ(t) andT (t) are energy consumption and completion time values, after normalization.

C. DRL-Based Offloading Algorithm Design
This section presents the algorithm of the DRL-based offloading scheme for XR devices. First, the problem formulation is described. It employs the Markov Decision Process (MDP) framework, as follows.
1) STATE SPACE: The state space of the agent (located at XR devices) includes all possible observations. Each observation is specified by a tuple P, E, C , where: r P = 0, 1, . . . N denotes the set of Application sub-tasks that are specified as single-chain applications with N being the number of tasks.
r E denotes the remaining energy of the XR device (expressed as percentage %) r C refers to the Channel State Information (CSI), monitored in the current state. 2) ACTION SPACE: The Action space incorporates |A| available actions that the agent can perform in a given state. We define the action space with two values: A = 0, 1, 2 where: 0 and 1 denote local computing and offloading to MEC server, respectively, and 2 indicates that the device is in idle or waiting states.

3) REWARD FUNCTION:
The reward signal is calculated by using eq. (11) to calculate the feedback of the chosen action for a specific state. Fig. 8 illustrates the Long Short Term Memory (LSTM) Actor Critic (AC) based architecture for solving the MDP. LSTM is a powerful artificial neural network architecture that is widely used in prediction and classification, such as in time series data [42]. In this paper, LSTM is used to learn the temporal regularity of states in terms of RSSI, energy consumption and application sub-task status due to device mobility. Details of the LSTM AC-based architecture are described next.
r Representation network incorporates a fully connected (FC) layer and an LSTM layer. This network is responsible for detecting the temporal correlation of states. The FC layer takes the buffer B as input and then feeds the extracted feature tensor to the LSTM layer. The output of the LSTM layer is the variation regularity of of states from the last T observation vectors in the buffer. After T updates, last LSTM cell outputs a completed representation of the environment h t that is then used as input for both Actor and Critic networks.
r Actor network comprises one FC layer that takes the output from the representation network and generates actions for the current states that is specified by a Softmax function. The output of Softmax function is a probability of different available actions π(a t |s t ). Then, the taken action is sampled following π(a t |s t ).
r Critic network estimates value of current state and incorporates two FC layers. The first FC layer takes the h t from representation network and extract value-related features. Then, the second FC layer output the estimated state value V (s t ) Algorithm 1 presents the DRLXR scheme in details. θ and w are Actor and Critic network parameters. We use a buffer B with length T to concatenate a series of states to feed into the LSTM layer. We initialize the buffer via running a loop with T iterations to take a series of states into B. From the beginning of each loop, all states in the buffer B are concatenated and fed into the representation network. The output h t is then considered as input of both Critic and Actor networks. The action a t is taken Update θ of the Actor network following eq. (12) Update w of the Critic network following eq. (13) via sampling from the output of the Actor network and the next state s t is then appended into the buffer B. The output of the Critic network is the estimated value of V (s t ). Next, the agent continues to concatenate the data from buffer B to form another input s t+1 . The value V (s t+1 ) is then estimated from the output of Critic network. We calculate the Temporal Difference (TD) error δ by using equation δ = r t + γV (s t+1 ) − V (s t ). If α A and α C are the learning rates of Actor and Critic networks, respectively, the values of θ and w are updated according to (12) and (13). The parameters θ for the Actor network and w for the Critic network are updated based on the following equations: (a t |s t , θ). (12) w ← w + α C δ∇v(s t , w).

V. PERFORMANCE EVALUATION
This section discusses the validation of our proposed scheme in a simulation environment under different test scenarios.

A. Experimental Setup
We build our testing environment in Network Simulator NS-3 [43]. Then, we implement the Actor-Critic model on Tensor-Flow 2.4 3 and train the agent on OpenGym AI [44] framework. The computer for testing is installed with Ubuntu Linux 18.0 LTS and has 32 GB memory and an Intel Core i7 6 th gen processor. In this testing, there is no need for using a GPU for training. Fig. 9 illustrates the network topology employed for testing. We assume that a number of mobile XR devices are moving around in an area at walking speed under the coverage of some MEC servers. Fig. 10 [45] illustrates the computation components of an XR application. The functionality of the major components is briefly introduced next.
r Video Source fetches video frames from the camera hardware.
r Renderer renders an overlay on the screen. r Tracker component processes the camera frames and estimates the camera position with respect to the world based on a number of visual feature points. The more feature points we use, the more stable the tracking. Increased feature points also makes tracking the camera more robust during sudden movements.  r ObjectRecognizer tries to recognize known objects in the world and notifies the Renderer of their 3D position when found. Depending on the latency requirement and current energy consumption situation, XR device can decide one component is either executed locally or offloaded to MEC server. For example, Tracker, Mapper, ObjectRecognizer components can be offloaded to MEC server whereas Video Source and Renderer computation are executed locally as illustrated in Fig. 10. Based on the relation between components, we built the dependency model based on DAG, as illustrated in Fig. 11. We assume that multiple applications are running in parallel in an XR device. The simulation setup details are summarized as in Table II.
In order to evaluate and compare our proposed schemes to other algorithms, we use the following metrics: r Average energy consumption (in Joules) across all devices r Average total completion time of tasks We compare our proposed solution DRLXR with the following baseline algorithms: r No-Offloading scheme (NO) [46]: All tasks are handled locally at devices and all data is received from the network.
r Greedy policy (Greedy): Each task is greedily assigned to the XR device or a MEC server based on its estimated completion time.
r Q-Learning method (Q-Learning) [47]: That is a traditional temporal difference algorithm, which always pursues the largest reward in the next time step. In addition, Q-Learning always records rewards in each iteration. When system state or action spaces are large, this solution tends to use large memory.
r Dynamic RL Scheduling (DRLS) [48]: A reinforcement learning-based offloading scheme that combines both D2D and MEC systems.

B. Results Discussion
In all cases, the energy consumption and total completion time of the Non-Offloading (NO) scheme are unchanged due to the local execution. We consider this case as the baseline for the other schemes to compare against. Fig. 12 illustrates the average energy consumption with different offloading data sizes. We observe that the energy consumption of XR devices is proportional to the increase in the offloaded data size due to the energy usage for transmitting and receiving data over the radio link. When the offloading data size is small (less than 40 MB), the average energy consumption of all cases is similar (experiences slight differences only). At the breaking point of 80 MB, the Greedy method results are increasing sharply. Although the other schemes perform more stable, DRLXR has better results with lower energy consumption of about 150 × 10 6 Joules in comparison to 160 × 10 6 Joules and 177 × 10 6 Joules of DRLS and Q-Learning methods, respectively. Fig. 13 and Fig. 14    average energy consumption are around 77 × 10 6 Joules, 75 × 10 6 Joules and 63 × 10 6 Joules, respectively, whereas the result of DRLXR is around 60 × 10 6 Joules. A similar situation also occurs at about 8 MEC servers and above related to the results of the average total completion time. Starting from 130 s, the average total completion time of all schemes decreases and is kept stable at 70 s, 68 s, 62 s and 60 s for Greedy, Q-learning, DRLS and DRLXR, respectively. The following are the reasons that explain the benefits of using DRLXR in comparison with the alternative solutions. In DRLS, XR devices can offload the computation to other peers via D2D communications, that lead to higher total energy consumption. On the other hand, Q-learning does not specify an exploration mechanism, but a greedy manner and requires all actions be tried infinitely in all states. Such a mechanism has lower accuracy when making offloading decisions. Unlike them, DRLXR employs the Actor-Critic method that specify a full exploration mechanism by the action probabilities of the Actor. In addition, DRLXR is trained from historical data that lead to higher accuracy of offloading decisions.
Finally, the total completion time with different numbers of XR devices is shown in Fig. 15. The number of mobile devices at each MEC server is randomly generated by a uniform distribution, and the average total completion time is calculated as a performance indicator. Greedy and Q-learning methods' results are similar to those of the NO scheme for 50 XR devices, with a time completion of around 127 s. DRLS completion time increases at lower speed due to the probability of data exchange with other D2D peers. However, due to the limitation in computation, other XR devices that receive the offloaded computation from their peers cannot process the large amounts of data (due to the characteristics of XR applications) in a timely manner. On the contrary, XR devices in DRLXR make offloading decisions with higher accuracy than Q-learning and all high intensive computation tasks are guaranteed to be processed at MEC servers and the stringent latency requirements are met. As a consequence, DRLXR time completion increases at a lower pace and outperforms other counterparts.

VI. CONCLUSION
This paper proposed and designed the Deep Reinforcement Learning-based Offloading scheme for XR devices (DRLXR) in the context of a MEC-enabled network environment. A hierarchical network architecture with three levels is considered. The task offloading problem at the XR device is formulated using DRL. Based on the data monitored at the XR devices, including radio signal quality, energy consumption and status of running application, the devices employ an Actor-Critic method for training and decision making on task offloading. The proposed DRLXR scheme is evaluated in a simulation environment and compared against other offloading methods. The simulation results show how DRLXR outperforms the other solutions in terms of average energy consumption and total completion time.
Future works will focus on a joint solution that combines the proposed offloading scheme and resource management at MEC server under heterogeneous QoS requirements.