Hierarchical Reinforcement-Learning for Real-Time Scheduling of Agile Satellites

As is well known, satellite resources are extremely scarce relative to observation demands. Consequently, the Earth observation satellite (EOS) scheduling becomes a remarkable problem which is of significant importance. As an NP-hard problem, It is difficult to get an optimal solution. Furthermore, real-time scheduling makes it even more challenging for researchers. Unfortunately, although fruitful results have been achieved in the category of EOS scheduling, there still exist a number of obvious limitations and drawbacks. For example, the response speed and stability are always limited in the scheduling of urgent tasks that appeared stochastically. To overcome this obstacle, a reinforcement learning algorithm, which is of the ability to make a fast response for the urgent task scheduling, has been proposed in this paper. In order to improve scheduling stability and reduce computational complexity, hierarchical architecture with two layers has been established for the proposed algorithm. In each hierarchical layer, we adopt an online learning paradigm to explore a scheduling strategy at the learning phase. According to the algorithm, the satellite takes a scheduling action when urgent tasks arrive randomly according to a certain strategy. The environment will feedback to the satellite by the corresponding rewards of the actions taken. After multiple feedback, the satellite will select the action that can obtain the greatest benefit. In practical space applications, the satellite can employ the learned strategy to operate the low orbit satellite selection and observation time window (OTW) assignment for urgent tasks in stochastic scenarios, which realize an immediate schedule and maximize scheduling stability at the same time. Finally, a numerical experiment has been performed. The simulation results demonstrate that, compared to the heuristic m-WSITF algorithm, the proposed algorithm possesses significant advantages in effectiveness and efficiency, especially in response speed and stability.


I. INTRODUCTION
In recent years, earth observation satellites (EOS) have played a crucial role in many regions such as marine development, disaster reduction, environmental protection, and land resource detection. Compared to the explosive growth of applications, satellites are still scarce. Hence task scheduling of EOS becomes a surging topic to make the most of the precious satellite resources. More and more countries attach more importance to the satellite schedule.
From the traditional EOS to the agile EOS, many pioneers made many contributions to the satellite scheduling problem. Wolfe et al. proved EOS scheduling problem to be an agile-satellite scheduling problem with a simplified model and solved by four different algorithms [9]. The agile satellite scheduling problem is formulated as a constraint optimization model, new mixed-integer programming model, and adaptive large neighborhood search framework [10]- [12].
To decrease the scheduling spare time, Peng et al. constructed a fast and efficient iterative local search algorithm which combined with the concept of minimum transition time between two adjacent tasks [17]. For the time-dependent constraint considering, Chu et al. developed an efficient branch and bound algorithm [18]. To improve the search efficiency, He et al. proposed a self-adaptive large neighborhood search algorithm [19]. Considering the image quality determined by the attitude angle of the satellite, Peng et al. present an exact algorithm for the single-orbit scheduling of an AEOS [20]. Several researchers have studied heuristic algorithms. To maximize the total profit and ensure the fairness of resource sharing, Panwadee et al. developed an indicator-based multi-objective metaheuristic algorithm by minimizing the maximum profit difference between users [13]. Xie et al. proposed a temporal conflict network-based heuristic algorithm, which extracted from a temporal conflict network that characterizes the conflicts of the visible time windows [14]. Xu et al. constructed a constructive algorithm that adopts two priority-based indicators that measure the benefits and costs of different decisions [15]. To solve the master optimization problem of maximizing the entire observation profit under cloud coverage uncertainty, Han et al. developed a heuristic algorithm includes a fast insertion strategy based on an improved simulated annealing algorithm [16]. Chen et al. proposes an efficient local search heuristic algorithm based on a fast evaluation mechanism to solve this NP-hardness problem in realistic large size [29]. Wu et al. proposes a heuristic and exact algorithm based on a divided framework consisting of two iterative phases: task allocation among orbits and concrete task scheduling on a single orbit [30]. To solve the multiple-observation AEOS scheduling problem, Wang et al. designed an improved feedback structured heuristic by defining the node and target importance factors in the multiple-observation complex networks model [31].
All the previous studies focused on offline scheduling for normal tasks that the demand information is obtained in advance. Meanwhile, since the computing resource of the satellite is limited, its difficult to realize online scheduling with traditional algorithms. Therefore, offline scheduling with delayed responses and a lot of calculation is no longer suitable for urgent tasks. Related literature mainly carried out on scheduling methods and strategies. Several studies generated the plan on the ground completely and then uploaded it to the satellite system [21]- [23]. In several pieces of literature, the observation plan is generated online. The satellites reach consistent scheduling strategy through local scheduling and more than once communication [24]. To realize seamless integration of task overlap, merging, and insertion, a new satellite scheduling algorithm FTSS was introduced with fault-tolerant [26]. To minimize the imaging time, Niu Xiaonan investigated a compact task merging strategy by merging the existing tasks and urgent tasks into a composite task [27]. To solve the urgent-oriented scheduling problems, Wang et al. developed two heuristic algorithms [32]. Besides, there are several related necessary recent technologies. Tajiki et al. proposed a multi-layer structure for a software-defined network scheduling problem with failure recovery and failure prevention, then constructed a heuristic algorithm suitable for large-scale network scheduling with low computation [35]. Several scholars researched the problem in rescheduling perspective [25], [28], [33].
Although fruitful results have been achieved in the category of EOS scheduling, most of the existing real-time scheduling literature focused on non-agile satellites, few studies for agile satellites, and the response speed and stability are always limited.
In this paper, to improve the response speed for urgent tasks, we construct autonomous-learning real-time scheduling (ARS) algorithm under a two hierarchical architecture for multi-agile EOS schedule shown in Figure.1. Firstly, to improve scheduling stability and reduce computational 220524 VOLUME 8, 2020 complexity, we established the hierarchical scheduling architecture in this paper. Secondly, a newly constructed scheduling model is presented based on the Markov decision process (MDP) for each hierarchical layer. Finally, we designed an adaptive searching strategy and combined it with a well-known reinforcement learning algorithm (Q learning) to optimally solve the model.
The remainder of this paper is organized as follows. In Section II, we begin with the problem description followed by the scheduling model construction. Then the general procedure of the scheduling algorithm is outlined, while the calculation complexity was analyzed in section III. In Section IV, we present the experiment results of the algorithm. Finally, this paper ends with a conclusion in Section V.

II. PROBLEM STATEMENT
With the development of agile satellites, satellite observation has played an important role in many domains. To reduce the threat of life and property losses caused by disasters on the earth's surface to a greater extent, we focus on the multiple AEOSs scheduling when the urgent tasks arrived for observation stochastically in this paper. As is well known, in stochastic environments, the well-known reinforcement learning algorithm Q learning could learn to perform optimal actions under certain conditions in controlled Markovian. The challenge of the multi-agile satellite real-time scheduling in this paper is how to apply Q learning to the problem feasibly.
Since offline scheduling on the earth for normal tasks and online scheduling on the satellite for urgent tasks all with their advantages, Chien pointed out that autonomous satellites must balance the aims between long-term and shortterm [34]. In this paper, low-orbit satellites would observe the normal tasks until the urgent task arrives. Specifically, we assume that the observation plan of normal tasks will be transferred up to the satellite periodically by the ground station and the urgent tasks arrived randomly. To achieve better scheduling efficiency, we constructed the scheduling model with two hierarchical layers. In the top layer, when the urgent task arrives, the high-orbit satellite as the agent chooses one of the low-orbit satellites to schedule. Besides, the work in the foundation layer is to make a specific response to the urgent task by low-orbit satellites as agents. To assign an OTW for the urgent task, the low-orbit satellite adopts the remove strategy by deleting several normal tasks. To minimize the loss of the normal tasks, an urgent task would be scheduled after completely observed the current normal task. As the AEOS with strong maneuverability prolonged the visible time window (VTW), consequently, there are more choices for the insertion of urgent tasks. For convenience, we summarized several notations in Table 1. Moreover, to satisfy the real-time requirement and make the scheduling stable, we stipulate to insert the urgent task after task m c or m a or else refuse to observe. So task m a and m b are two candidate tasks replaced shown on the right of Figure 1.
In real AEOS scheduling, it is much complex that consists of orbital operations, observation images download, making the problem much complicated to solve. To simplify the problem, several basic assumptions in the scheduling problem are proposed as follows: 1) Normal tasks with observation priority, OTW, attitude angle, and the initial energy of each satellite are uploaded to the satellite at the beginning of our scheduling; 2) The observation images download is not considered in the scheduling process since we assume that the ground data transmission stations are enough to satisfy these requirements. 3) The normal task being removed will be rescheduled in the following satellite orbit. In the scheduling process, several necessary constraints about satellites and tasks can not be ignored which presented in the following: 1) Observation opportunities: The urgent task will be assigned to one low-orbit satellite and observed at most once, the mathematical formulation is denoted as: where the indicator function x i denoted as: 2) Attitude angle: The attitude angle of the detected urgent task should be in the maneuvering range of low-orbit satellites. We expressed this constraint as: 3) VTW: The AEOS has an extensive choice for OTW due to its strong attitude maneuverability. We define d u as the required observation duration of the urgent task, t s as the beginning, and t e as the ending time respectively for the urgent task imaging.
Based on the above assumptions and constraints, the objective of this paper is to enhance the real-time interaction for the satellite to urgent tasks, which represent as follows:   1) To decrease the loss of normal tasks, we want to minimize the size and the priority of removed normal tasks. 2) For the conservation of satellite energy, the attitude transfer between the urgent task and the first removed normal task should be minimized. 3) To satisfy the real-time requirement, the response time t for the urgent task should be minimized.

III. HIERARCHICAL SCHEDULING MODEL
In this section, we will formulate the task scheduling problem to an MDP model. Now we begin with a general notation for MDP followed by the definitions of the elements. Classical Q learning is modeled by a discrete state Markov decision process. In a multiple agent system, we represents the kth agent U k as a tuple (S, A, P a ss , R a ss ), which includes a finite set of states S, a finite set of actions A, the state transition probability function P a ss that defines the probability of state transfer from state s∈S to the next state s ∈S after performing action a∈A, and its reward function R a ss that represents the reward after performing action a∈A. Then, the learning process of the satellite schedule is as shown in the following based on the Q learning algorithm.
• The autonomous learning satellite would make a state judgment of the current environment when the urgent task arrives.
• We designed a searching policy combined with Q matrix.
• The satellite selects an action a∈A at state s∈S according to the exploration strategy.
• After executing the selected action, the satellite receives a reward R a ss , which is used to updating the q value according to (5). (5) In (5), the learning rate α and the discount factor γ are essential to future rewards. In the experiment section, the above parameters are analyzed under different value levels. As the agent explores in the action space, the agent acquired more experience, and the action selection will converge gradually. Then the searching process converges to an optimal decision strategy after finite learning steps, and the optimal state-action pair determined by the final Q matrix after the learning process. The quality of the state-action pair is measured by Q(s, a). The model construction of the learning process is depicted in Fig. 2.
In the immediate moment of an urgent task's coming, the scheduling environment is usually very complex. To use MDP in the construction of the real-time task scheduling model, firstly, we designed the main elements, such as state space, action space, and reward function in each scheduling layer according to the scheduling characteristic. The detailed process will be represented in the following. We also summarized the variables involved in the model construction process in Table 4.

A. FOUNDATION-LAYER
The learning process of urgent tasks' scheduling in the foundation-layer is shown on the right of Figure 2. Firstly, to differentiate the state space in detail, we present the related factors combined with the scheduling objective.
• Attitude angle: The satellite will make different attitude angles transfer between the normal task and the urgent task when the satellite chooses a different OTW for the urgent task. And the energy consumption caused by angle transfer is expressed as follows, where π R is the energy consumption for unit angle shift.
• Observe duration: The satellite prioritizes to select the first replaced task with an overtake duration to the urgent task, preferably a little bit, which improves the robustness of the offline observation plan. Otherwise, there will be at least two normal tasks to be removed.
• Task priority: The choice of replaced normal tasks with different priorities will have different effects on the reward. Based on the above discussion, we extracted several variables such as attitude related, observation duration related, and task priority related to the division of states as follows: x 4 = t b t u (10) In order to save the online memory of satellites, we noted that when select to the first replacement normal task, the priority and attitude angle have the same influence trend. We merged variable x 1 , x 2 to x 6 : Now we present the items of the model in the foundationlayer.

1) STATE SPACE
We measured the observation duration difference between task m a and m b by variable x 3 , x 4 , andx 5 , and the attitude angle and priority difference for different replaced tasks are measured by variable x 6 . As shown in Table 2, we constructed a state space that includes 6 different states by measuring the influence trends of the variables on the selection process. So we can determine all states from Table 2. For instance, when urgent task arrives, we got x 6 = 0.8, x 3 = 1.1, x 4 = 1.9, x 5 = 1.7. Apparently, known from Table 2, it belongs to state 2 at this moment.

2) ACTION SPACE
Whenever the urgent task arrives, the satellite considers observing the urgent task after m c or m a . Since the satellite has autonomous decision-making capability, it is possible to refuse to observe the urgent tasks, but a certain penalty would be displayed in the reward function. Consequently, as the autonomous learning process progresses, the occurrence of such action will be reduced. Apparently, the above three actions constitute space action.

3) REWARD
According to the former two objectives stated at the end of Section II, we calculated the reward function in two parts. In the first part, variable x 6 is considered, then we define a new function y 1 : where the subscript c denotes the related parameters of the normal task removed first for the urgent task. On the contrary, the subscript o indicates the related parameters of the candidate task not to be selected as the first replacement. The reward function of the first part is y 1 .
In the second part, the reward relates to variable x 3 , x 4 , x 5 . To calculate the reward value, we defined the variable x 3 , x 4 , x 5 : where the meaning of the subscript c and the subscript o are the same in formula (13).

VOLUME 8, 2020
The reward function of the second part is shown as: In summary, the reward function is:

B. TOP-LAYER
Suppose that our scheduling architecture consists of one high-orbit satellite and multiple low-orbit satellites. The learning process of urgent tasks' scheduling in the top-layer is shown on the left of Figure 2, which assigns the urgent task to one of the low-orbit satellites. Learned from the scheduling process in the foundation layer, the characters of task m c ,m a ,m b in each low orbit satellite are important to the satellite selection process. Therefore, we concluded the related factors of scheduling in the top layer.
• Energy: The high-orbit satellite prioritizes allocate the urgent task to the low-orbit satellite with more remaining energy.
• Remaining time t c : The high-orbit satellite prioritizes assign urgent tasks to the low-satellite which with a minimum remaining observation time of task m c .
• Priority: We denoted the average priority value of task m a and m b as p a +p b 2 in each low-orbit satellite. The high-orbit satellite tends to assign an urgent task to the low-orbit satellite with the maximum average priority. Combining the above factors, we extracted three related variables for the division of states: E, t c ,and p a +p b 2 . To save the memory of satellites, considering the size of the remaining time t c and the average priority p a +p b 2 have the same trend of effect to the low-orbit satellite select. Therefore, we merged variable t c , p a +p b 2 to F:

1) STATE SPACE
A simple way to determine the states is to consider the sequence of variable E and F in all low orbit-satellites, thus we seted 36 different states as shown in the Table 3. So we can determine all states from  Table 3, it belongs to state23 at the current time.

2) ACTION SPACE
As depicted in the left half of Figure 1, when the urgent task arrives, the high-orbit satellite would select one of the low-orbit satellites to observe the task. Therefore, the selection for all the low-orbit satellites constitutes the action space for the top layer.

Algorithm 1 Task-Assigning
Input: the set of normal tasks the seted state space and the action space, prob = 1,the task size N Q = 0 //a zero matrix set used to contain the q value A = 0 //a zero matrix set used to contain the learning steps of each state output: the new Q matrix for i = 1 : N begin online scheduling Judge the prsent state i belongs to the state space

3) REWARD FUNCTION
According to the three objectives stated at the end of Section II, the reward function is inversely proportional to the parameter F of the selected low-orbit satellites and directly proportional to E in the discussion above. The mathematical formulation is denoted as where a and b are the coefficients between 0 and 1.

IV. ONLINE SCHEDULING ALGORITHM
Noted that the proposed model is characterized as an MDP structure. Hence, to solve the MDP model more efficiently, we proposed the Q learning algorithm with an adaptive action selection strategy under a hierarchical architecture, which facilitates scheduling. The general framework of the hierarchical architecture consists of two parts, the autonomous learning and authority schedule, which are described in this section.

A. AUTONOMOUS LEARNING
The specific steps of the learning process are shown in Algorithm 1. In this algorithm, we set k as the maximum learning times and mark it as the current learning times m (i) for state i. In the adaptive action selection strategy, we constructed a probability heuristic indicator to guide the exploration as prob (i) = 1 − (m(i))/(k). If prob (i) > ε, we choose the least selected action, else if prob (i) < ε, then choose the action with the largest q. Obviously, the value of prob decreases as the learning frequency of state increases, and the more chance for the action with the largest q be chosen. This learning process continues until prob (i) < 0.
The flowchart of the learning algorithm in the hierarchical architecture is as shown in Figure 3. In the flowchart, the values of co(i) and ct(j), corresponding to the top layer and the foundation layer, represents the current learning iteration times of state i and j respectively.
• Computational complexity The efficiency of our algorithm can be judged from computations as follows: First, we need to initialize Q for every state s ∈ S and action a ∈ A, while m, n are the number of states and actions. Therefore, the complexity of the first step is O (mn), The complexity of selecting action is O n * log 2 n . The complexity of calculating Q is O n * log 2 n . In summary, the complexity of our algorithm for urgent task response scheduling is O (m * n) + k * O n * log 2 n which is close to O n 3 .
This learning process is pivotal for real-time scheduling, especially in the updating of the q value that guides the formulation of the scheduling strategy.

B. AUTHORITY SCHEDULING
Authority scheduling means the optimal scheduling strategy applied to realistic scheduling after the learning process. The satellite implements a greedy algorithm with the maximum q value for urgent tasks in stochastic scenarios. We got an instant response strategy at the sacrifice of the exploration steps above.

A. EXPERIMENTS SETUP
To evaluate the performance of the proposed hierarchical ARS algorithm, we conducted extensive computational experiments that a series of episodes based on a centralized distributed constellation composed of four satellites, which learning how to schedule the urgent task. Each episode begins with selecting actions randomly for the agent that helps agents to obtain the best action. Exploration is achieved by taking actions based on the probability heuristic indicator. Then the corresponding rewards are calculated, the Q matrix is updated. Repeats this process until the schedule strategy converges.
To verify the efficiency of the proposed algorithm, we performed comparison experiments to the m-WSITF heuristic algorithm [33]. The related parameters of the satellite, normal tasks, urgent tasks, and the related parameters in the experiment process are listed in Table 5.

B. EXPERIMENTAL RESULTS
Aim to demonstrate the feasibility of the ARS algorithm applied in learning how to schedule tasks that arrive stochastically in this experiment, the feasibility of the ARS algorithm application should be tested first. Under six combinations of parameter α and γ that provided in Table 5, the corresponding reward of the even value for all states in the learning process of the ARS algorithm as shown in Figure 4, which showed that the higher of γ , the profit better, and the lower value of α increases a better learning effect. Consequently, to obtain a VOLUME 8, 2020 better learning effect, we set parameters α = 0.1, γ = 0.9 in the following experiments under all states.
The detailed learning process of the former nine states in the top-layer is as shown in Figure 5. The value of q fluctuates heavily in the initial learning steps that are the necessary exploration process. As the learning process progresses, the q value gradually tends to stabilize. Then we got relatively convergent q values, which constitutes the Q matrix of our scheduling mechanism. At the same time, we got the curve of the action selection process in the ARS algorithm for urgent tasks that arrived stochastically. The converging process of the action selection under the former nine states in the top layer has shown in Figure 6. We got that action selection begins with random choices, gradually converges to the optimal action strategy after accumulating experience by the feedback reward from the learning environment. And the converged action evenly distributed in the action space.
To verify the efficiency of the proposed algorithm, we should compare the proposed ARS algorithm with existing works. Specifically, we performed comparison experiments with the m-WSITF heuristic algorithm [33]. In the m-WSITF algorithm experiment, we set the same parameters for normal tasks and emergency tasks as the ARS algorithm. Now we present the execution process of the comparison experiment: according to the C*-driven rescheduling and T-driven rescheduling principle of the m-WSITF heuristic algorithm, the rescheduling process didn't start up until the urgent tasks accumulated to 8 or a scheduling duration accumulated to 200 seconds. And then sort all the urgent tasks and normal tasks according to the heuristic rules. The heuristic indicator set for regular tasks is the ratio of priority to observation duration, set for the urgent tasks is the ratio of the maximum priority to the observation duration. Firstly, all tasks will be executed according to the chronological order. If the OTW conflict exists between adjacent tasks, the task with the lower heuristic indicator value will be moving back for one OTW, the moving process will be executed until out of the visible observation window or the conflict disappears. At the same time, since each move brings the increase of energy consumption and scheduling time caused by the conflict, the value of the rewards shrank.
The scheduling time is shown on the left of Figure 7 under the urgent task scale of 50,100,150, 200. Due to the conflict   caused tasks movement and re-sequencing that according to the heuristic indicator under the m-WSITF heuristic algorithm, then the scheduling time and the energy-consuming relatively increased. Under the proposed ARS algorithm in this paper, when the learning process is completed, we can obtain an optimal scheduling strategy for all different states. Therefore, after the learning iterations, from Table 2 and  Table 3, the current state can be determined. Then we can get the specific observation time for the urgent task by searching the Q matrix, without multiple moves and arrangements. Apparently, the time consumed by the proposed ARS algorithm is slightly shorter than the heuristic m-WSITF algorithm under the same task scale.
Besides, due to the different scheduling strategies, there exists a big difference in response time from receiving the urgent task to the schedule beginning. As mentioned above, since the C * -driven and T-driven schedule are adopted in the m-WSITF algorithm, the scheduling for urgent tasks didn't start up until the urgent tasks have accumulated to 8 or a scheduling duration accumulated to 200 seconds in our experiment. Obviously, the execution time for urgent tasks is delayed inevitably. According to the proposed ARS algorithm, the urgent task will be observed before the second following normal tasks of the current task. Under the 50, 100, 150, 200 urgent tasks scale, the difference of the delayed time, which response to urgent tasks between the m-WSITF algorithm and the ARS algorithm, has shown on the right of Figure 7. And the responding time by m-WSITF is more than twice ARS.
According to the scheduling strategy in the m-WSITF algorithm, when re-scheduling starts, according to heuristic rules, all the normal tasks and urgent tasks are sequenced. When a conflict exists between the adjacent tasks, the task with a higher heuristic indicator will be observed first. It can be seen that not all urgent tasks can be observed. As shown in Table 6, under four different task scales, in the proposed ARS algorithm, the ratio of urgent tasks being observed has reached 100% at the sacrifice of the replaced normal tasks. At the sacrifice of the uncompleted urgent tasks, the scale of normal tasks being replaced (NNTR) in the m-WSITF algorithm is less than the ARS algorithm. In all, the sum of the urgent tasks and routine tasks being refused to schedule in the m-WSITF algorithm is more than the ARS algorithm. Meanwhile, compared to the m-WSITF algorithm, the ARS algorithm increased more trigger times for urgent tasks. Noticed that the reciprocal of the reward in m-WSITF is always higher than ARS because the C * -driven and T-driven strategy which caused much OTW shifting for normal tasks, the increase of calculations, and the energy consumption of satellite.
The results stated above showed that the agents would learn the strategies for scheduling urgent tasks by Q learning. The experiment shows that the ARS algorithm can find the best scheduling strategy infinite iterations.

VI. CONCLUSION
We have studied the multi-agile satellite real-time scheduling problem for urgent tasks based on the proposed hierarchical architecture. To solve the urgent task assignment problem, we have developed an adaptive action selection strategy to guide the algorithm that searches for the optimal solution efficiently. In our experiments, we analyzed the computation complexity and shown the convergence of the proposed ARS algorithm in the action space. After checked the Q matrix, the optimal action would be selected that achieves instant scheduling. Compared with the heuristic m-WSITF algorithm, the ARS scheduling algorithm improves the response time for urgent tasks and the even computation time of the scheduling process. When completed the same scale urgent tasks, the m-WSITF heuristic algorithm has more fluctuation on normal tasks. All the urgent tasks under the proposed algorithm have been executed. And the loss of normal tasks affected by urgent tasks has been reduced.
This paper provides a useful method for real-time task scheduling, but improvements in this search strategy are still possible. Future directions should target complex scheduling problems such as stereo observing in different tracks under this scheduling architecture. Meanwhile, the scheduling mechanism will be further extended.