Edge-Enabled Two-Stage Scheduling Based on Deep Reinforcement Learning for Internet of Everything

Nowadays, the concept of Internet of Everything (IoE) is becoming a hotly discussed topic, which is playing an increasingly indispensable role in modern intelligent applications. These applications are known for their real-time requirements under limited network and computing resources, thus it becomes a highly demanding task to transform and compute tremendous amount of raw data in a cloud center. The edge–cloud computing infrastructure allows a large amount of data to be processed on nearby edge nodes and then only the extracted and encrypted key features are transmitted to the data center. This offers the potential to achieve an end–edge–cloud-based big data intelligence for IoE in a typical two-stage data processing scheme, while satisfying a data security constraint. In this study, a deep-reinforcement-learning-enhanced two-stage scheduling (DRL-TSS) model is proposed to address the NP-hard problem in terms of operation complexity in end–edge–cloud Internet of Things systems, which is able to allocate computing resources within an edge-enabled infrastructure to ensure computing task to be completed with minimum cost. A presorting scheme based on Johnson’s rule is developed and applied to preprocess the two-stage tasks on multiple executors, and a DRL mechanism is developed to minimize the overall makespan based on a newly designed instant reward that takes into account the maximal utilization of each executor in edge-enabled two-stage scheduling. The performance of our method is evaluated and compared with three existing scheduling techniques, and experimental results demonstrate the ability of our proposed algorithm in achieving better learning efficiency and scheduling performance with a 1.1-approximation to the targeted optimal IoE applications.


I. INTRODUCTION
T HE ADVANCEMENT of communication technologies has led to the widespread of edge computing and the Internet of Things (IoT)-enabled devices. These interconnected devices are capable of real-time data collection, processing, and communication with an edge server, and form the foundation of modern intelligent services [1], [2]. While these smart devices are providing unprecedented benefits to our daily lives, the amount of collected data and communication requirements are also growing dramatically. The growing quantity of devices, data, and security requirements is all relying on stable and efficient computation and communication to ensure the timely transfer of collected data in a secure manner [3], [4].
The growing penetration of IoT technologies is reflected in several different sectors, such as consumer electronics, healthcare, and industrial automation [5], [6], forming the so-called Internet of Everything (IoE). Extending from ordinary IoT applications, applications relying on cloud-edge infrastructure are facing more challenges in terms of efficient resource allocation to ensure reliable and optimal task completion time across the entire distributed system [7]- [9]. It is of critical importance for modern industrial systems to support multiple heterogeneous applications (e.g., multiple production lines), and make efficient use of the available edge servers to complete all tasks in time. How to arrange these tasks with sequential operations on multiple executors distributed on edge servers becomes the key problem investigated in edge computing environments with smart IoT devices.
Different from other existing methods, we focus on a novel methodology to pursue the optimal two-stage scheduling to ensure a more efficient use of the available computing and communication resources in a typical multiflowline production system implemented through the concept of IoE. The This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ diversity of multiple production lines and their corresponding application requirements make scheduling an NP-hard problem. In this article, a reinforcement-learning-based heuristic scheduling method, named deep-reinforcement-learningenhanced two-stage scheduling (DRL-TSS), is proposed to support a more efficient data-intensive task allocation and execution in an end-edge-cloud infrastructure. An integrated deep learning framework, which consists of two basic modules as Johnson's rule-based presorter and DRL-based scheduler, is designed to cope with the makespan minimization problem with multiple executors. Two algorithms are then developed to realize Johnson's rule-based task presorting and reinforcement-learning-based two-stage scheduling, respectively. The major contributions are summarized as follows.
1) A unique two-stage scheduling problem is addressed and modeled to support the operation complexity in the endedge-cloud IoT system, aiming to improve the efficiency of computing resources shared by devices from multiple heterogeneous applications. 2) Johnson's rule-based presorting scheme is designed and applied to preprocess the two-stage tasks on multiple executors, which can effectively avoid the worst situation in the targeted NP-hard scheduling problem and enhance the overall efficiency in the proposed reinforcementlearning-based scheduling. 3) A DRL mechanism is developed aiming to minimize the overall makespan in edge-enabled two-stage scheduling, in which the action value function is improved based on a newly designed instant reward that considers the maximal utilization of each executor for data-intensive tasks in IoE applications. The remainder of this article is organized as follows. Section II compares the state-of-the-art techniques in reinforcement learning, followed by the scheduling approaches targeting IoE applications. Section III addresses the application scenario and problem formulation. Section IV discusses the proposed reinforcement-learning-based scheduling method with detailed mechanisms. We introduce the experimental design and evaluate the performance of the proposed method against existing scheduling control approaches in Section V. Section VI concludes this study and gives promising perspectives regarding future research.

II. RELATED WORK
Several topics relating to this study, including reinforcement learning models in IoT systems and task scheduling algorithms for IoT applications, are studied and analyzed, respectively, in this section.

A. Reinforcement Learning in IoT Systems
Recently, the development of reinforcement learning has become one of the most attractive directions of machine learning and AI techniques in modern large-scale and complex network applications, such as heterogeneous networks, Internet of Vehicles (IoV), and IoT [10]- [12]. Jiang et al. [13] constructed a reinforcement-learning-based framework to optimize the number of served IoT devices for resource configuration in NarrowBand IoT networks. They designed an action aggregation method based on the deep Q-network to improve the convergence capability in multiparameter and multigroup scenarios using the multiagent learning strategy. Wang et al. [14] proposed a so-called mobile-IoTbased multimodal reinforcement learning service framework, in which the action-aware transition tensor was utilized for heterogeneous data fusion, and a Markov decision model was applied to enhance the multimodal reinforcement learning process with the optimal tensor policy. Considering the multihop ad hoc networks with IoT devices, Kwon et al. [15] built an autonomous network in which each IoT device was viewed as a decision-making agent based on the Markov decision process. They maximized the estimated cumulative future reward in a deep neural network, to improve the learning process with minimal transmission power consumption. Nassar and Yilmaz [16] considered the Markov decision process with the reinforcement learning model together to solve the adaptive resource allocation problem. They investigated and compared the performances of four reinforcement learning schemes in optimizing the fine-grained decision-making policies for IoT applications in fog radio access networks. Camelo et al. [17] focused on the parallel reinforcement learning and developed a partitioning algorithm to optimize communications for reinforcement-learning-based IoT applications in distributed environments. They employed a local affinity policy to improve the reinforcement learning algorithm with a dynamic partitioning scheme in a heuristic co-allocation process. Xiong et al. [18] formulated the resource allocation problem as a Markov decision process in IoT edge computing systems. They employed DRL to improve the resource allocation policy, in which the Q-network was redesigned based on the multiple replay memories to improve the training process. Ivoghlian et al. [19] introduced a deep Q-network-based multiagent framework for automatic network management targeting typical LoRaWAN-based IoT networks.

B. Task Scheduling for IoT Applications
In current years, with the prevalence of mobile and edge computing, the scheduling algorithm has become an important technique for task management and resource allocation in IoT-assisted applications. Leithon et al. [20] designed a framework to optimize the task scheduling within the offgrid IoT nodes. They proposed a mixed linear programming method-based online scheduling strategy with a sorting-based mechanism, which could result in a lower computational complexity. Lee and Lee [21] developed a hybrid algorithm to deal with the centralized resource and task scheduling issues, which aimed to minimize the average on-grid energy consumption, and thus satisfied the minimum average data rate requirement on each IoT device based on the distributed task scheduling. Qi et al. [22] focused on IoV, as one specific application of IoT in autonomous driving and applied DRL in parallel with multitask scheduling. They proposed a model-free scheduling method to improve the multitask learning problem by assigning parallel tasks with different computing resources in a  [27] developed algorithms to tackle energy efficiency problem for IoT applications. Specifically, He et al. [25] formulated the offload task scheduling problem as a constrained Markov decision process. They developed a deep Q-learning-enhanced algorithm to maximize the long-term average reward, in order to tackle the cost-constrained task scheduling problem. Shan et al. [26] presented a two-step scheduling method to reduce the energy consumption when offloading the task data in transparent computing-empowered IoT devices. Different from these existing approaches with two-stage scheduling and the Markov feature, our work focuses more on the concurrent computing makespan for IoT tasks distributed on multiple executors.

III. TWO-STAGE SCHEDULING IN END-EDGE-CLOUD IOT SYSTEMS
In this section, to explain the proposed DRL-TSS model, we first describe the application scenario of a typical endedge-cloud IoT system and then introduce an overview of the reinforcement learning scheme incorporated with Johnson's rule.

A. Application Scenario
The explosive growth of IoT brings great challenges for many industrial IoT deployments, ranging from latency, network bandwidth, to reliability and security. Edge computing, which is a distributed information technology architecture, is playing a more and more important role in addressing those challenges. The client data is stored at and processed by devices at the edge of networks rather than the central cloud data center for lower latency and better responsiveness. The processed results are then transferred to the cloud center as needed.
In a typical industrial IoT system, such as a workplace safety surveillance system, edge computing can combine and analyze data from on-site cameras, employee safety devices, and various other end devices to help businesses oversee workplace conditions or ensure that employees follow established safety protocols. As illustrated in Fig. 1, the workplace safety surveillance system is expressed as a three-layer end-edgecloud system. The end layer is composed of various sensors and end devices (e.g., cameras). The video streams captured by cameras are generated continuously and then transmitted to the edge layer through LAN/WAN. The edge layer is made up of several edge nodes, which are responsible for real-time data processing, data caching, filtering, basic analytics, and M2M communication. Under our scenario, each edge node is regarded as an AI box, which provides a two-stage operation for this system. The two-stage operation includes the stage 1 data processing which splits the video stream into several segments and keeps only the key image frames containing the important information, and the stage 2 data transmission that caches and transmits the processed data (the key image frames) to the next layer. The cloud layer is a cloud data center that is in charge of big data processing such as safety inspection or other high-level applications. The deployed scheduling controller in this layer is responsible for conducting the task arrangement across the entire distributed system.
Specifically, in the edge layer, each AI box is regarded as an executor and a video stream from a camera in the end layer is regarded as a task. Under the LAN/WAN network environment, tasks (video streams) are assigned to multiple executors (AI boxes) under certain rules and are processed in a two-stage operation described previously. As captured by different kinds of cameras, video streams vary in quality, size, and length and contain different volumes of key image frames that need to be transmitted. Therefore, the processing time and transmission time for each task also vary correspondingly. To finish all tasks with a minimum time, the scheduling controller in the cloud layer needs to formulate a schedule that instructs each individual AI box to retrieve and execute the required tasks and achieves an optimal overall task completion time.
Suppose there are n two-stage tasks executed on m executors in the scheduling problem. Each task that arrives at the scheduling controller will be allocated to an executor through the scheduling controller. Each task in this scenario has different durations but contains the same two-stage operations, i.e., data processing and data transmission, while the data transmission operation must be performed after the completion of the data processing operation. According to the working mechanism, a set of assumptions is given below in this scheduling problem: 1) for each task, a data process operation needs to be completed first before the transmission operation can start; 2) each executor can execute a process operation and a transmission operation from different tasks simultaneously but can execute only one operation for a specific task at one time; 3) both operations of a task are executed and completed by the same executor; and 4) tasks cannot be preempted in this study.

B. Problem Formulation
In this article, we consider a two-stage scheduling problem of n tasks J = {J 1 , J 2 , . . . , J n } scheduled to m executors and O i2 represent the processing and transmission operation, respectively, and d i1 and d i2 are the corresponding duration of each operation.
The goal of the scheduling algorithm is to find a feasible policy to allocate all the tasks to time intervals on the executors that minimize the total completion time, or makespan, denoted as c max . According to [28], we consider the two-stage scheduling problem as a typical multiprocessor flow-shop scheduling problem with the goal of minimizing the makespan.
Given a task J i , we define T i1 and T i2 as the beginning times for O i1 and O i2 , c i1 and c i2 as the completion time, respectively. Accordingly, the completion time constraint can be concluded as follows: A feasible two-stage schedule is given as the explanation, which is shown in Fig. 2.
As illustrated in Fig. 2, the makespan c max equals to the completion time of the last executed task J 8 , assigned to executor E 3 .
In general, the makespan c max means the max completion time within the m executors. Suppose task J l is the last completed task, the makespan for the schedule can be expressed as follows: Therefore, the optimization goal is to minimize c * max , subject to T i2 ≥ c i1 , i = 1, 2, . . . , n, which means the starting time of the second operation should be later than the completing time of the first for each task.

A. DRL-TSS Framework
The key objective of a two-stage scheduling algorithm is to assign tasks to executors in an optimal sequence, so as to ensure the minimal completion time. The basic framework of the proposed DRL-TSS is shown in Fig. 3.
Specifically, the proposed DRL-TSS model is constructed to handle n tasks by m executors, which includes two main modules: 1) Johnson's rule-based presorter and 2) DRL-based scheduler. Since the makespan minimization problem that schedules a set of two-stage tasks in multiple executors has Initialize two task groups G1 = ∅, G2 = ∅, and J = ∅ 2: for Sort all tasks in G1 in ascending order based on the duration time d i1 for each task J i ∈ G1 8: Sort all tasks in G2 in descending order based on the duration time d i2 for each task J i ∈ G2 9: Merge the two task lists by appending G2 behind G1 as J = G1 ∪ G2 10: return J been proved to be NP-hard, the DRL is utilized as a heuristic approach to obtain an approximate solution for the investigated two-stage scheduling problem. However, to avoid the worst situation occurred in the scheduling process, as shown in Fig. 3, all the tasks are preliminarily sorted into a specific Johnson's order and waiting for further scheduling. A scheduling controller is then introduced to observe the system state and make a scheduling action decision in the end-edge-cloud environment. Based on the newly designed DRL scheme which considers the maximal utilization of each executor, the controller outputs the scheduled action from the probabilistic transition according to the received cumulative rewards.

B. Johnson's Rule-Based Presorting
Johnson's rule has proven to be able to obtain optimal solutions for scheduling problems in two-stage tasks with a single executor [29]. Therefore, it is used for task presorting in our model. It is known that the task list scheduled based on Johnson's rule is called Johnson's list, which has two theorems as follows.
Theorem 1: Johnson's list is an optimal solution for a twostage, single executor scheduling problem.
Theorem 2: The subset of Johnson's list is also Johnson's list.
To improve the overall efficiency of the reinforcementlearning-based scheduling method, all the two-stage tasks are presorted into Johnson's list using the following scheme. According to Theorem 2, the task list scheduled to each executor is the subset of Johnson's list, this is also Johnson's list. This presorting operation can ensure all the parallel scheduling in each individual executor would be an optimal solution, respectively, following which the next issue would be how to schedule the presorted tasks to the multiple executors and make a global optimal.
As demonstrated in Algorithm 1, all the tasks are separated into two groups G1 and G2 by comparing the two operations' execution time spans d i1 , d i2 for each task J i . Particularly, G1 contains the tasks that satisfy d i1 ≤ d i2 , while G2 contains the tasks that satisfy d i1 > d i2 . In addition, all the tasks need to be sorted into Johnson's list in both G1 and G2 before being merged together finally.

C. Deep Reinforcement Learning for Edge-Enabled Scheduling
Generally, reinforcement learning is based on the Markov decision process, which is characterized by the fact that the state transition of the system is only related to the current state. Typically, the Markov decision process-based reinforcement learning can be described using a four-tuple as follows: where S is the state set at timestamp t, which describes the state space of the two-stage task scheduling. s t , s t+1 ∈ S. A indicates the action space, which includes all the possible arrangement of any task J i into any executor E k . a t ∈ A. P(s t , a t , s t+1 ) refers to the state transition probability when it moves from state s t to the next state s t+1 after executing action a t . R t = R(s t , a t ) is the corresponding reward obtained if executes action a t at state s t . Specifically, in our scheduling problem, the Markov decision process of this scenario is formulated as follows.
State: The state s t is a set of selected features for all the tasks and executors, which can be expressed as follows: where D n1 (t) = {d i1 } and D n2 (t) = {d i2 } denote the set of duration or processing time for the two stages of tasks that are executed at t, respectively. C n1 (t) = {c i1 } and C n2 (t) = {c i2 } indicate the set of completion time for the two stages of tasks that are executed at t, respectively, the value of each element is initialized to 0, and will be set to the corresponding completion time c i1 and c i2 once the task J i has been finished. U m (t) = {u k (t)} describes the payload of all m executors, in which each u k (t) indicates the utilization of the executor E k . In particular, the payload for each E k at s t can be defined as follows: where c k (t) is the total completion time of all tasks scheduled to executor E k according to Johnson's list. It is noted that u k (t) is initialized to 0 at state s 0 , which is then calculated and ranged from 0 to 1, according to the ratio of the completion time of current executor E k to the makespan c max (t) of the system at s t .
Action: The action a t at t is designed to determine which executor is scheduled to process the next task according to the action value function, until all tasks are executed.
Reward: The reward R t stands for the instant reward at t when taking action a t at state s t . The goal of the DRL-based scheduler is to maximize the total rewards across all the states in S after t.
Actually, the total rewards will be evaluated by all the instant rewards in each step into the future after t. The weight for the reward in each step is defined and calculated as γ ∈ [0, 1], which is interpreted as the probability to accumulate the reward score at every step and ensure the highest final return. Considering the optimal objective in this study is to minimize the ultimate makespan c * max and maximize the utilization of each executor, we first quantify the overall utilization of all the executors at t, which can be described as follows: It is noted that all the tasks scheduled to each executor are presorted into Johnson's list, according to Theorem 1, these scheduled two-stage tasks can be considered as the optimal solution in one executor. Thus, the overall equilibrium of the whole payload becomes the essential issue to realize the final optimal objective in our DRL-TSS model. Considering U t reflects the overall payload performance of the whole system based on (6), it can be further used to measure the instant reward at t.
Accordingly, the value function v π (s) at state s t = s can be calculated as the expectation of the accumulative rewards Initialize action value function Q and target action value function Q with weights θ = θ by Eq. (9) 2: Initialize learning step σ , greedy exploration probability , and discounting factor γ 3: Initialize experience replay buffer set D 4: for episode eps = 1 to MaxBatchSize do 5: for t = 1 to n do 6: Select random action a t that assigns J t to a random executor with probability , otherwise a t = argmax a Q(s t , a; θ) 7: Obtain state s t+1 by executing action a t , and calculate reward r t by Eq. (6) 8: Store transition (s t , a t , r t , s t+1 ) in replay buffer D 9: Sample random minibatch of transition (s j , a j , r j , s j+1 ) from D 10: Calculate Calculate error e j = (y j − Q(s j , a j ; θ)) 2 and conduct gradient descent step by e j 12: Reset Q = Q in every σ steps 13: end for 14: end for after t, which can be described as follows: v π (s) = E(R t+1 + γ R t+2 + γ 2 R t+3 + · · · |s t = s).
Finally, the action value function can be deduced as follows: The training process of the DRL-TSS model is shown in Algorithm 2, which is an approximation algorithm aiming to minimize the overall makespan for the edge-enabled two-stage scheduling.
This algorithm accepts a list of no-preemptive two-stage tasks J presorted in Johnson's order as the input. In each training episode, the possible scheduling, including the corresponding actions and state transitions, is then generated based on the -greedy exploration with < 10% (in line 6). By executing the selected action a t associated with the scheduled tasks in J at state s t , the system reward r t will be calculated based on (6). To speed up the state transition during the training process, the whole transition (s t , a t , r t , s t+1 ) is stored in the replay buffer D. We sample a minibatch of the transition from D to train the action value function Q(s j , a j ; θ) to close to y j during the gradient descent process (from lines 9 to 12).

V. EXPERIMENT AND ANALYSIS
In this section, simulation-based experiments are designed and conducted to demonstrate the performance of our proposed DRL-TSS in an end-edge-cloud IoT system, compared with three baseline methods.

A. Experiment Design
To simulate the scheduling scenario of arbitrary two-stage tasks executed on multiple executors in an end-edge-cloud IoT system, totally 1000 tasks are taken into account with randomly assigned execution and transmission time. To reflect the actual task scheduling in a typical IoT environment, tasks are composed of trivial tasks (processing time: 10-100 μs), median tasks (processing time: 100-1000 μs), and heavy tasks (processing time: 1000-10000 μs) with 30%, 45%, and 25% proportion, respectively, according to real end-edge-cloud IoT scenarios. The number of executors varies from 2 to 64. The makespan of all the tasks spent on the controller is investigated as our goal in the experiment. To mimic a real-world situation, a random workload is assigned to each executor in the initial state varying from 10 to 100 μs.
In addition, the DRL-TSS is compared with three two-stage scheduling algorithms, which are described as follows.
1) Default First-in-First-Out scheduling (FIFO): FIFO executes tasks on multiple executors by the same order as the tasks arrive in. 2) Preprocessing List Scheduling (PLS) [30]: PLS is a greedy algorithm that executes an ordered list of tasks by assigning them with some priorities in a preprocessing procedure.

3) Johnson's Rule-based Genetic Algorithm (JRGA) [31]:
JRGA executes tasks on multiple executors by incorporating Johnson's rule in the decoding process of GA to optimize the makespan for each executor. Specifically, in this experiment, the settings for JRGA are configured as follows: the combination of TPX crossover operation and OM mutation operation, the population size of 20, the crossover probability of 60%, the mutation probability of 15%, and a maximum generation of 100. The efficiency of all four methods is evaluated by the approximation ratio as follows: where T c stands for the actual makespan spent for executing all the tasks by the algorithms. T opt denotes the theoretically optimal makespan which indicates the performance standard to evaluate the proposed DRL-TSS and other methods. It is noted that T opt cannot be calculated since the problem investigated in this study is NP hard. Therefore, the bound of the optimal is applied in (10) instead. In the following experiment, the approximation ratio ρ is utilized to present how close the makespan gained by the algorithms to the bound optimal makespan, and we are interested in obtaining a minimum ρ in the experiment.

B. Evaluation on Learning Performance
The learning process of DRL-TSS is investigated via observing the reward gained along with the iterations. We set the discount factor of DRL to 0.9 and the learning rate to 0.5 empirically. The 1000 tasks and 500 tasks were assigned to a typical ten slave executors, respectively, to investigate how the rewards were gained in different task scheduling processes. The learning curves in these two situations are illustrated in Fig. 4.
As shown in Fig. 4, both the reward curves of 1000 tasks and 500 tasks increase fast and then stabilize after a few iterations (100 iterations for 1000 tasks and 20 iterations for 500 tasks) during the learning process. This convergence phenomenon depicts the fact that the proposed DRL-based scheduling method can quickly obtain a close to the optimal action strategy, which can be applied to the resource-constrained environment in end-edge-cloud IoT applications. In addition, it can be observed that the reward gained by the scenario of 1000 tasks (around 28) is greater than that of 500 tasks (around 13). This is likely due to the reason that more tasks may facilitate the learning process in such a situation.

C. Evaluation on Task Scheduling Efficiency
To demonstrate the efficiency of the proposed DRL-TSS algorithm, two evaluation scenarios are designed to investigate the performances of all four scheduling algorithms via the approximation ratio ρ.
First, we demonstrate how the approximation ratio ρ changes when different number of tasks are scheduled on varying number of executors, which means we try to investigate how the algorithms' approximation ratio varies with the growth of the number of tasks. The evaluation is performed by scheduling 100-1000 tasks on 2, 4, 8, 16, 32, and 64 executors, respectively. The evaluation results of the DRL-TSS and other three baseline methods are shown in Fig. 5.
As shown in Fig. 5(a)-(f), the proposed DRL-TSS is compared with PLS, FIFO, and JRGA by executing 100-1000 tasks on 2-64 executors. It can be seen that all the algorithms obtain a relatively low approximation ratio with a larger number of tasks and a higher approximation ratio in the cases with more executors. This is mainly due to a more balance scheduling situation occurring when executing more tasks on definite executors. It can be clearly observed that the proposed DRL-TSS outperforms all the baseline methods by achieving the lowest approximation ratio in all the tests and achieves a 1.1-approximation ratio when the number of tasks is 1000.
Second, we investigate how the approximation ratio ρ changes when an explicit number of tasks is assigned to different executors in the end-edge-cloud IoT environment. Different from the evaluation result shown in Fig. 5, Fig. 6 presents the influence of the varying number of executors on scheduling specific number of tasks.
As shown in Fig. 6, given a specific number of tasks, the uplifted approximation ratio curves demonstrate that all the algorithms perform worse when the number of executors is increasing. This is due to the fact that the increasing number of executors brings unbalance and randomness to the scheduling process. In addition, all algorithms tend to be relatively stable along with the increasing executors except the JRGA, which indicates that the GA-based heuristic algorithm will be affected by the complex crossover and mutation operations. In particular, the performance of the proposed DRL-TSS remains stable when the executed tasks increase from 500 to 1000. Moreover, our algorithm achieves the lowest approximation ratio among all the methods, which approximates to a 1.0-approximation ratio in tests of 500 and 1000 tasks.

VI. CONCLUSION
The growing quantity of IoT devices demands a more efficient usage of shared computing and communication resources in an end-edge-cloud environment. This is particularly important for IoE applications, which are usually time critical and very costly to expand their computation and communication facilities. This article introduced a DRL-TSS method to find more efficient schedules that allow optimal use of the available resources, especially in a distributed IoE system that manages heterogeneous big data with multiple executors.
Following a deep learning framework for two-stage scheduling in end-edge-cloud IoT systems, two algorithms, namely, Johnson's rule-based task presorting scheme in multiple executors, and an improved DRL scheme that considers the maximal utilization of each executor in a newly designed instant reward, were developed to pursue the minimal overall makespan in dealing with the NP-hard scheduling problem. The proposed algorithm was evaluated in a simulated IoE setting compared with three existing scheduling approaches, in terms of learning efficiency and scheduling performance. The experimental results demonstrated that our proposed DRL-TSS algorithm could achieve a 1.1-approximation ratio to the optimal bound makespan when handling intensive tasks in end-edge-cloud-enabled IoE applications.
In future studies, we will investigate a more efficient deep learning scheme for scheduling optimization in IoE environments. More evaluations in more complex situations will be conducted to improve and examine the scheduling algorithm with better efficiency. Jianhua Ma (Senior Member, IEEE) received the Ph.D. degree in information engineering from Xidian University, Xi'an, China, in 1990.
He is a Professor with the Department of Digital Media, Faculty of Computer and Information Sciences, Hosei University, Tokyo, Japan. He is one of pioneers in research on Hyper World and Cyber World (CW) since 1996. He first proposed Ubiquitous Intelligence toward Smart World, which he envisioned in 2004, and was featured in the European ID People Magazine in 2005. He has conducted several unique CW-related projects, including the Cyber Individual (Cyber-I), which was featured by and highlighted on the front page of IEEE Computing Now in 2011. He has published more than 300 papers, coauthored five books, and edited over 30 journal special issues. He has founded three IEEE Congresses on "Smart World," "Cybermatics," and "Cyber Science and Technology," respectively, as well as IEEE Conferences on Ubiquitous Intelligence and Computing, Pervasive Intelligence and Computing, Advanced and Trusted Computing, Dependable, Autonomic and Secure Computing, Cyber Physical and Social Computing, Internet of Things, and Internet of People. His research interests include multimedia, networking, pervasive computing, social computing, wearable technology, IoT, smart things, and cyber intelligence.
Dr. Ma is the Chair of the IEEE SMC Technical Committee on Cybermatics, the Founding Chair of the IEEE CIS Technical Committee on Smart World, and in the advisory board of the IEEE CS Technical Committee on Scalable Computing. He is a member of ACM.
Qun Jin (Senior Member, IEEE) received the Ph.D. degree in electrical engineering and computer science from Nihon University, Tokyo, Japan, in 1992.
He is a Professor with the Networked Information Systems Laboratory, Department of Human Informatics and Cognitive Sciences, Faculty of Human Sciences, Waseda University, Tokorozawa, Japan. He has been extensively engaged in research works in the fields of computer science, information systems, and social and human informatics. He seeks to exploit the rich interdependence between theory and practice in his work with interdisciplinary and integrated approaches. His recent research interests cover human-centric ubiquitous computing, behavior and cognitive informatics, big data, data quality assurance and sustainable use, personal analytics and individual modeling, intelligence computing, blockchain, cyber security, cyber-enabled applications in healthcare, and computing for well-being.
Dr. Jin is a Senior Member of the Association for Computing Machinery and Information Processing Society of Japan.