Quantum algorithms applied to satellite mission planning for Earth observation

Earth imaging satellites are a crucial part of our everyday lives that enable global tracking of industrial activities. Use cases span many applications, from weather forecasting to digital maps, carbon footprint tracking, and vegetation monitoring. However, there are limitations; satellites are difficult to manufacture, expensive to maintain, and tricky to launch into orbit. Therefore, satellites must be employed efficiently. This poses a challenge known as the satellite mission planning problem, which could be computationally prohibitive to solve on large scales. However, close-to-optimal algorithms, such as greedy reinforcement learning and optimization algorithms, can often provide satisfactory resolutions. This paper introduces a set of quantum algorithms to solve the mission planning problem and demonstrate an advantage over the classical algorithms implemented thus far. The problem is formulated as maximizing the number of high-priority tasks completed on real datasets containing thousands of tasks and multiple satellites. This work demonstrates that through solution-chaining and clustering, optimization and machine learning algorithms offer the greatest potential for optimal solutions. This paper notably illustrates that a hybridized quantum-enhanced reinforcement learning agent can achieve a completion percentage of 98.5% over high-priority tasks, significantly improving over the baseline greedy methods with a completion rate of 75.8%. The results presented in this work pave the way to quantum-enabled solutions in the space industry and, more generally, future mission planning problems across industries.


I. INTRODUCTION
The reliable functioning of Earth-orbiting satellites crucially affects our everyday services such as connectivity [1], navigation [2], and media [3]. Most satellites receive dynamic instructions on executing their mission in orbit, and planning the exact sequence of tasks is critical to the efficiency and sustainability of the project [4]. Algorithmic optimization solutions have been suggested [5]- [7] as a remedy to the planning problem. With the rise of quantum technologies, there is a need to explore how quantum computing can improve the time complexity or the quality of these solutions. This work focuses on planning the mission of Earth-orbiting imaging satellites using quantum machine learning [8]- [10] and optimization. Specifically, it explores using near-term quantum technologies to improve the solution's effectiveness today. Similar approaches, such as the contribution of Ref. [11], suggested an algorithm to overcome the scheduling task using quantum annealers. Ref. [12] reviewed the literature and found that although innovative developments existed, none revealed any practical advantage over classical approaches. The current work explores the interplay between classical and gate-based quantum computing algorithms. The practical advantage of employing this hybrid approach was shown in our earlier contributions [13]- [15].
In general, space mission planning can be a computationally hard problem [4] to solve; in large-scale missions, the size of the problem requires a prohibitive amount of computational resources. Finding an efficient plan requires the optimization of movement and the re-ordering of tasks to maximize the number of total completed tasks. In this work, each task is a request made to the satellite to capture an image of the surface of the Earth, and the aim is to maximize the number of images taken, given a list of all requests and a total available time. In this work, the imaging satellites orbit the Earth exactly on the Earth's terminator, which is the line separating Earth's sunlit areas from the dark ones. In 24 hours, the satellite orbits approximately 15 times around the Earth, which leads to 15 orbits shown schematically in Fig. 3(a). To capture the image, the satellite must continuously aim the camera at the target area for a time known as the acquisition slot. Each requested image has an allocated data-take opportunity (DTO) window. The satellite must point in the appropriate direction within this window for the entire acquisition period to accomplish a request. To aim at an area on Earth, the satellite needs to rotate its camera to point in the desired direction. The latter movement introduces a time delay that, when added to the acquisition time, could limit the overall agility of the satellite in covering multiple areas. The efficient satellite mission plan will be able to choose the order of the acquisition requests to maximize the number of completed requests. This paper discusses the benefits of the optimization and reinforcement learning algorithms over a greedy baseline algorithm, the unique advantage offered by the potential of quan-tum computing, and a demonstration of how quantum methods can be applied to improve machine learning and optimization models. Specifically, the novel quantum reinforcement learning approach shown in Sec. III-D4 offers a completion rate of 98.5% on highest-priority requests in a multi-satellite system. The model hybridizes the AlphaZero approach in Ref. [16] and is trained on the QMware quantum cloud [17]. Sec. II introduces the practical problem setup in more detail, including the details on the data formatting in Sec. II-A, re-ordering the requests in Sec. II-B, relaying algorithm in Sec. II-C, and on solution chaining in Sec. II-D. Sec. III offers three algorithms and their respective results: Sec. III-B establishes a baseline greedy algorithm, and Sec. III-C and III-D discuss the optimization and reinforcement learning algorithms. Finally, Sec. IV summarises the results.
II. SATELLITE MISSION PLANNING Each satellite's orbit is constrained to the Earth's terminator. Each satellite can only rotate at a maximum of one degree per second to orient itself appropriately to capture images within the acquisition window. Furthermore, the DTO duration for each area to be captured is defined by the arc ranging within 45 • (referred to as the depointing angle) from the apex point, which is the point directly above the center of the request on Earth. As a result of the width of these arcs, multiple requests can have substantial overlap in their acquisition windows. Furthermore, while coordinates for image requests are provided in latitude and longitude, the satellite coordinates support an Earth-centric inertial (ECI) format, providing more spatial location support. Finally, two datasets are tested in this work, including a single-satellite set containing 462 requests and a two-satellite system with 2000 requests. The main results of this work focus on the performance of various algorithms on the latter dataset and only for the requests with the highest priority.

A. Data Formatting
Two information sets were used in the preparation of this paper: • information about the satellite motion, including: the orbit number, time stamp, and satellite position and velocity in Cartesian ECI coordinates; and • information about acquisition request, including: request ID, request priority ranging from 1 to 4, where 4 denotes the lowest priority, start and end times of the DTO window of the request, the coordinates of the start and end of the median line, satellite ID, and -Boolean values indicating the progress of the acquisition. Fig. 1 demonstrates the satellite movements during acquisition and the data that must be tracked during the capture, such as the acquisition angles, the points at which the DTO begins and ends, and the median coordinates.

B. Request Priority Ordering
Considering the priority associated with each request, it is possible to express this objective as maximizing the completed requests in the order of priority; the highest priority requests are considered first, and only the lower priority requests are considered upon completion. The exact quantification compares the number of higher-priority requests accomplished. It moves on to the next priority in case of equality, but this work judges the algorithms by their completion rate performance purely on the high-priority π 1 request.

C. Relaying Algorithm
The satellite must rotate according to the relaying algorithm to move from one request to another. To compute this rotation, two points of interest are the final median points of the first request and the initial median point of the second request. These points, respectively, signify the time at which the first request was completed and the time at which the next request will begin. For a given transition, the relaying algorithm operates in two steps: 1) it computes the relative positions of each median point for the satellite, and 2) calculates in degrees the angle between these vectors. Assuming that the satellite rotates at a constant speed of one degree every second and that the Earth is a perfect sphere, the resultant angle can be approximated as the total relaying time in seconds.

D. Solution Chaining
Solution chaining considers the DTO windows, overlaps, and the windows' length compared to an acquisition time estimate. DTO windows and possible solutions within those windows are then visualized over an axis of time -see Fig. 2b. The most efficient preliminary method for this setting is to sort the requests by the start of their DTO windows and map the acquisitions.
Next, the individual requests chained: the intermittent time between captures is calculated using the relaying algorithm, after which the algorithm ensures sufficient time for the relaying movement between the completion of each pair of consecutive requests, ensuring that a sequence of requests can be executed in the provided time. Chaining enables the suggested solutions within the DTO windows of each request, which are otherwise independently scattered, to be connected together into one comprehensive solution for satellite movement. Shown in Fig. 2 To deal with increasing complexity in problems is to reduce the size of the dataset and deal with fewer parameters or data points at once. Therefore, clustering is useful to stratify the data by similarities and shorten the calculations necessary for the overall program. The simplest method to cluster the dataset is sorting the samples based on a single feature of the data, a method known as bunching. The bunching algorithms are explained in App. A, but this section focuses on unsupervised-learning clustering methods such as the Kmeans algorithm [18], [19].
This algorithm is a well-known clustering algorithm, and its usage of physical distances in creating clusters makes it a natural fit for the Space Mission Planning problem. This algorithm is iterative, meaning it runs multiple iterations of the same steps and finally converges at a solution. The first step of this algorithm is the assignment stage; once k random points are initialized to represent the k cluster centroids, each point in the dataset is assigned to the closest centroid. Once this is complete, the values of all points assigned to each centroid k n are averaged, and the value of k n is updated to the newly obtained mean value. Once these two steps are completed, they are repeated until the centroids no longer shift between two iterations (generally to an error threshold), which is when the final clustering for the dataset is attained. The K-means algorithm clustered the dataset based on all the features of the requests: the DTO start and end times and the coordinates of the start and end of the median line. This ensured that clusters bunch the requests based on their DTO windows and geographical locations. The latter is crucial in efficiently using the relaying algorithm, as the geographically-close requests are simpler to inter-navigate.

B. Greedy algorithm
The classical greedy algorithm is used to provide a reference solution. First, the orbital and request data are separated by satellite, and then the requests are clustered. In each cluster, the algorithm begins at the satellite's given start time and increments by one second until it enters the DTO window of the first request. Once the time stamp enters a DTO window, the algorithm checks for enough time to complete a request. If so, it completes the request, increments the time by the sum of the relaying time and the acquisition time, adds the request ID to a list of completed IDs and moves on to the next request. The algorithm chooses the highest priority request if multiple requests are available at a timestamp. If insufficient time to complete a request, it simply moves on to the next request and repeats the process. Once the final request is completed, or the timestamp exceeds the final DTO window, the algorithm outputs the percentage of completed requests of each priority.  An important limitation of the greedy algorithm is its bandwidth for anticipation, but it provides a baseline for comparison. Its simplicity allows it to run faster than other models, making it a good benchmark for data preprocessing. The results in Fig. 3(b) and Tab. I show the greedy algorithm struggles as the complexity of the input data increases. Therefore, to solve this problem, algorithms that can handle high levels of complexity without compromising accuracy are required. The results of the greedy algorithm serve as the baseline algorithm in this work. As evident in Tab. I, the greedy algorithm solves the 1sat/462req dataset well but struggles on the 2sat/2000req dataset. Thus, the latter was chosen as the comparison dataset as it allowed more possibilities for improvement. Furthermore, it should be noted that only the top priority π 1 requests were taken as the baseline, so the models' accuracy performance will be compared with the baseline π 1 2sat/2000req of 75.8%.

C. Optimization Methods
As discussed earlier, one potential way to improve the runtime of an algorithm was to break the data into smaller clusters to reduce the necessary processing power. However, attempting to build a fundamentally more powerful algorithm, such as an optimization model, could be more fruitful. Optimization problems involve information and formulations, including graphs, movement patterns and permutations, and multiple viable solutions, out of which one ideal solution must be determined by the metrics and constraints of the problem. As such, optimization methods are well suited for mission planning problems akin to the one at hand, which aims to map out the best possible course of action for a system of satellites seeking to maximize the number of completed requests. Optimization, however, is expensive in terms of time and computation; consequently, the potential of harnessing the quantum advantage for optimization problems could be especially valuable in discovering and delivering a large speedup for mission planning problems.
Some research for optimization for space mission planning currently exists. For instance, in Ref. [20], the problem of Earth observation from a satellite (EOS) is investigated, focusing on obtaining images of certain areas of the Earth's surface related to customer requests. An optimization approach for EOS's daily photo selection (DPSP) is proposed. DPSP is related to operational management and planning processes, where each photo the client orders brings profit. Still, not all requests can be satisfied due to physical and technological limitations. The objective is, therefore, to select a subset of queries for which the profit is maximized, and the proposed algorithm is based on the metaheuristic ant colony optimization algorithm (ACO). Examples based on real data are used as reference problems. The calculations show that the proposed algorithm can generate competitive and promising solutions.
The most well-known quantum-friendly optimization method is the quadratic unconstrained binary optimization (QUBO) model, which encapsulates the set of all optimization problems with the following attributes: • all variables are treated as binary objects • constraints, while not necessarily absolute, are enforced accordingly by use of the reward function This work explores the QUBO-adaptable formulation known as the integer optimization model for the space mission planning problem. This solution has benefits and drawbacks but can serve as a viable solution to the satellite optimization problem.
The integer optimization model is conceptualized by encoding the desirable outcome of the solution in the cost  function, which should be maximized or minimized on the lattice of integer points of the special feasible subset in the multidimensional space. This feasible subspace is defined by adding the constraints of the satellite mission planning problem. One of the most powerful approaches to that problem is the branch and cut algorithm [21], used in solvers. In addition, the following functions are incorporated to streamline the calculations for the algorithm to stay within the constraints. Incorporating these functions is crucial to developing viable solutions as these functions, built over the Orekit space flight dynamics library [22], are the key to accurate space mission planning.
• Attitude Pointing: given the satellite, timestamp, and a pair of longitude and latitude coordinates, this function returns the attitude (roll, pitch, and yaw) angles of the satellite when it is pointed towards the provided coordinates. • Acquisition Duration: given the starting and ending median coordinates of any request, this function returns the amount of time, in milliseconds, that the satellite would take to acquire the image. • Maneuver Duration: given the beginning and ending attitude angles (roll, pitch, and yaw), this function returns the amount of time necessary for the satellite to complete the maneuver from one attitude to the other. • Read Ephemeris: given a JSON file and a satellite ID, this function returns the satellite orbital information (directional speed, directional position, timestamps, etc.) in the form of an Orekit Ephemeris object, which is then used to perform calculations for other functions using the Orekit library. Upon incorporating the above functions, revision of the constraints on this integer-based model, and modification of the clustering algorithm to generate a greater number of smaller clusters, the algorithm's runtime was maintained despite the significant increase in complexity. Given a set of requests, the variables central to this model include the priority of each request, the acquisition duration of each request, the start and the end of the DTO windows, and the location of the median points of the requests. From this, the algorithm calculates for each reasonable pair of the requests inside the small cluster all possible start times for the 1st request in a way that allows the relaying maneuver to the 2d request. This is achieved through simple iteration, using the Orekit functions described above: an initial value for the relaying maneuver duration is set to one second. Then, the maneuver start and end angles are calculated with the attitude pointing function. Afterwards, suppose the rotation time between the obtained angles via the maneuver duration function is not equal to or greater than one second, and the DTO limitations are not violated. In that case, meaning the relaying time is equal to one second. Otherwise, the duration is increased by one second, and this procedure is repeated until either the duration is appropriate or the DTO is violated. The result of these computations, t min , is treated as the minimum relaying maneuver time.
Given a set of requests f ∈ F, the priority of each request is denoted as π f , and the acquisition duration as τ f . The start of the DTO window (release time) for request f is represented as r f and the deadline of the DTO window is represented as d f . In addition, Q ⊂ N indicates the order of accepted requests. As a result, index q ∈ Q = {0, . . . , Q − 1} demonstrates the position in the queue of requests in which request f is accepted. The discretized rotation of the satellite at the inception of the acquisition of request f is indicated by α start f ∈ A start f . Similarly, α end f ∈ A end f is the discretized rotation of the satellite after the acquisition of the request. However, all possible inceptions b f α for requests can be streamlined in order of increasing time. Note that b f α is the set of all possible points in time at which it is possible to complete the f 1 request and then move to the f 2 request. That is, it is at a time greater than r f . Thus all the possible angles will correspond to these moments. The set of possible pairs is defined as L, such that the maneuver f 1 → f 2 is possible; i.e.: The binary variable x f q is introduced such that: In addition, the variable y f α is also introduced, where: if angle α is the start angle for request f , 0, otherwise.
Finally, the indicator variable κ f1f2 shows if requests were completed successively one after another: The cost function is defined as: where J f is the weight of request f , and γ f α is a coefficient needed for penalizing time consuming solutions. J f = 1 was selected for the lowest priority requests in each cluster. For the higher priority requests, the weight is greater than the sum of all lower priority requests' weights by one since each higher priority request is valued more than any amount of lower priority requests. The coefficients γ f α range from 0 to 1 Q for the following definition: It guarantees that f α∈A start f γ f α y f α < 1, and consequently, the completion of the lowest priority request will be more important than the particular order of requests, but the earliest possible completion of each request is preferable.
As was the case with the other optimization model, more than one request cannot be executed in the same order, and each request should be completed not more than once: ∀q ∈ Q, f ∈F Any completed request is started with one particular possible angle: All requests are completed one after another without empty slots in line until the satellite stops: To evaluate the variable κ f 1f2 appropriately, the following system of linear equations is introduced, and each excludes the impossible values of κ f1f2 . Firstly, if requests f 1 and f 2 were completed straight one after another, then κ f1,f2 = 1: Each request is followed by not more than one request, and each request follows after not more than one request: If one request follows another one, then each of these requests was completed in some order: With the use of predefined mapping M f1,f2 , if the request f 1 starts with angle α 1 , and the request f 2 is the next one, then the next acquisition start angle is fixed: where α 2 = M f1,f2 (α 1 ).
Additionally, maneuver f 1 → f 2 is possible only with particular initial angles for f 1 . Otherwise, the satellite will not have enough time to finish the acquisition of the request f 1 and move to the f 2 : On the other hand, if f 1 → f 2 is an impossible transaction, then The final constraint fixes the acquisition start angle for the 1 st request in a queue as the earliest possible angle: where α 0 is an angle corresponding to the beginning of the DTO for request f . In contrast to the greedy algorithm, which considered the requests in the order of open DTO windows, this model is more intelligent in finding the best path to fit the maximal number of requests. In the end, the π 1 requests reached 98.1% completion using the Gurobi solver [23]. The solution obtained through optimization is partly depicted in Fig. 4. Furthermore, the clusters are connected, calculating the minimum relaying maneuver time from the last request of the previous cluster to that of the next. This is illustrated in Fig. 4 as cutting the beginning of every DTO from the start if it proves impossible to rotate towards the request in this period. After this procedure, the next cluster can be treated as independent.
One of the greatest benefits of this model is its compatibility with both near-term and long-term quantum technology. As a linear optimization model that uses a grid of binary parameters, it can be transformed into QUBO, as it is shown in App. B, and fit rather quickly to the quantum Ising model, a model containing arrays of qubits in a grid where their spin states depend on their neighbors. Furthermore, as this model functions akin to a minimization problem, it would be extremely efficient to run on a quantum annealing machine [24], which could solve optimization problems by slightly changing the Hamiltonian from a given initial state with a known minimum to a new state, representing the optimal solution. However, it is complicated to use the satellite mission planning problem in the QUBO form via currently available classical or quantum devices. For example, both D-Wave's Leap Hybrid solver [25] and Gurobi have difficulties solving even a small cluster with 4 requests over a time limit of 1 minute. Still, linear programming can obtain the solution for the 2000 requests and 2 satellites dataset in less than 3 minutes. This runtime was achieved by decomposing the clusters so that their final size was small enough to achieve approximately the same time as the greedy algorithm. Note that if the same number of requests were simultaneously considered, the runtime would be noticeably longer.

D. Reinforcement Learning
Reinforcement learning (RL) is a machine learning paradigm in which an agent interacts with some environment and trains through informed trial and error. The RL training algorithm uses reward functions to assign value to the agent's actions in any state of the environment. Generally, a state can have constraints and features. The RL agent can use a policy model to decide on an action given a state, which subsequently affects the environment and transforms the state. Suppose the resultant state of the environment is engineered by the data scientist to contain a positive reward. In that case, the policy model is trained to take the appropriate action to maximize the probability of achieving that reward. The Environment is a function of a triplet of variables (S, A, P ), where S is a state space, A is an action space, and P is a transition function. When the reward function, r, is factored in, a Markov Decision Process (MDP) is generated with the property (S, A, P , r), r : S × A → R. The Agent starts from state s 0 and takes action a 0 , for which the reward r 0 is obtained in each step of training and subsequently trains by producing trajectories T := (s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , · · · ). 1) RL Environments: Two environments are developed for this AEOS task: satellite-and request-centred environments. For each environment, requests are sorted by their DTO open time.
The satellite-centred environment: the agent views the problem from the satellite's perspective and considers the 100 closest (by DTO open time) data points for each satellite, each utilizing 10 data parameters. Moreover, three additional features were also made available to the agent: the current time for the satellite and the latitude and longitude of the starting point of the last completed request, creating a total observation space of size 1003. The number of nearest requests is a variable that depends on task size. A Boolean flag is one of the features kept for each request, marking each data point as either complete or incomplete to avoid redundancy, and the observation space is made to only consider requests that are marked as incomplete to expedite the computations required by the agent, which is built as a neural network.
Once the request is selected, the agent is rewarded with 1 if that request is completed and 0 otherwise. With this process recurring, the agent attempts to complete as many requests as possible until the time of the satellite is greater than the DTO windows of all remaining requests, meaning they can no longer be completed. The agent works with each satellite and predicts which request must be done. The data used in this environment was artificially generated. The request-centred environment: at each step, the agent views the problem from the perspective of a request, deciding which satellite is best suited to complete it. The 5 nearest request options are determined by DTO open time for each satellite, and the request to execute is chosen by the minimum request execution time. This minimum time is calculated as the sum of the satellite timestamp, maneuver duration and the acquisition time due to solution chaining. It must be less than the request DTO end time for completion. This procedure is then iterated with the next batch of 5 nearest satellites.
2) Proximal Policy Optimization: The Proximal Policy Optimization (PPO) algorithm, shown in Fig. 5 and first introduced in Ref. [26], was implemented to provide a mission planning policy. In RL, a policy is an operation that maps an action space to a state space. The agent learns the best action for each situation by calculating the policy gradients. In other words, the agent uses gradient descent to calculate the expected value of each action at a certain state space and determine which action has the likelihood of the highest reward. The equation for the PPO algorithm is as follows: (20) whereĒ t is the current expected value of the policy, θ is the policy parameter, ϵ is the hyperparameter,Ā t is the estimated advantage provided by the PPO at time t, and r t (θ) is the importance sampling ratio. This ratio, derived from the Monte Carlo sampling methods [26], denotes the ratio of probabilities under both the old and the new policies. By keeping the value of ϵ small, the model ensures that the update on the policy at each increment is not too large; as a consequence, the learning done by the model stays relevant.
3) Hybrid-quantum PPO: Inspired by the quantum advantage shown in Refs. [13]- [15], [27]- [30], this section investigates the utility of a hybrid-quantum neural network as a policy model for RL. Fig. 5(b) illustrates the specific policy network within a hybrid MLP model. In the quantumclassical network, the output of the classical neural network is used as inputs to a parametrized quantum circuit (PQC). PQCs are quantum circuits that use parametrized quantum gates [31], [32] such as Pauli rotations to encode data, x, and trainable parameters, θ. The four-qubit PQC used in this work consists of three significant components: variational, data encoding, and measurement. The variational layer comprises a layer of Pauli-X rotations encoding four trainable parameters and ring-shaped CNOTs for entanglement. The data encoding layer embedded four features in parallel using Pauli-Z rotations, and the measurement layer comprised two Z-basis measurements on the first two qubits. In sequence, four qubits were initialized in the ground state, and then a variational layer with randomly initialized parameters was appended, followed  Finally, the measurement layer was included to produce two classical real-valued outputs. Fig. 8(a) shows this improvement in practice: the hybrid quantum neural network achieves a higher reward in only 8k steps which remains inaccessible to a classical network of the same complexity even after 110k environment steps. Notably, these two networks were trained five times with varying initialization points, and the plot in Fig  Fig. 8(a) only shows the best runs of the hybrid and the classical models. The reader might notice that the hybrid model starts at a higher mean reward than the classical, a behaviour observed elsewhere in the literature [13]. Additionally, the quick ascent aligns with the findings of Ref. [33], suggesting that quantum models can generalize from a few data points. 4) Hybrid AlphaZero: The optimal results are expected from picking the best candidate from part of the solution pipeline. Specifically, in Sec. III-D, it was shown that reinforcement learning is a powerful algorithm that can be boosted in training and solution optimality through hybridization with quantum models. This section shows hybridized AlphaZero [16].
The classical AlphaZero was a reinforcement learning (RL) algorithm developed by DeepMind that showed promise in solving difficult RL problems, such as chess, shogi, and go [16]. AlphaZero uses a computational tree of environment states whose values and probabilities are determined from  The foundation of this model is akin to a Monte Carlo Tree Search (MCTS) model, which is used primarily for path prediction problems and board games based on strategy [34], [35]. The four major components of MCTS are selection, expansion, simulation, and backpropagation. The model begins at the tree's root and selects optimal child nodes until it reaches a leaf. Once it reaches the leaf, it expands the tree, creating a new child node. Then, the model simulates the remainder of the path-finding process from the newly created node and finally backpropagates with the newfound information to update the hyperparameters of the tree. The hybrid AlphaZero model uses MCTS with a parametrized quantum circuit as its policy network. The MLP agent of the model first recommends a state action. Then the action is applied to the environment, generating a reward and an updated state. However, in this case, the environment is the MCTS model; the recommended action is simulated in the model. Then, the backpropagation step of the MCTS is used to update the weights within the tree itself to refine its recommendations.
As each iteration of this loop occurs, there is also an outer loop in this algorithm. That outer loop is primarily concerned with the loss values being optimized; as a result, once the agent recommends an action, the outer loop takes training examples of a certain size and simulates the results over these batches to calculate the policy loss, value loss, and total loss. Once these values are calculated, the neural network weights are adjusted and fed back into the original loop, where the cycle repeats until the loss values are optimized. Fig. 7 illustrates the inner workings of the hybrid AlphaZero model employed in this paper. The state variables are passed into an initial encoding network with 2 hidden layers of sizes 256 and 32 neurons. The information is then passed in parallel [36] to 1) a single neuron using a fully-connected layer and 2) a policy network as a PQC. The PQC resembles the one explained and implemented in the PPO model in Sec. III-D3. Fig 8(b) shows the training performance of this model (best out of five tries), which achieves a completion rate of π 1 = 98.5%. Presenting the highest π 1 completion rate on the 2000 requests and 2 satellites dataset of any other model explored in this paper. The completion rates of the highest-performing algorithms from each section on this dataset are displayed in Fig. III.

IV. DISCUSSION
This work provided two classes of solutions to the scheduling problem of satellite mission planning: optimization and hybridized reinforcement learning. From each class, the bestperforming candidates were the integer optimization model and the hybrid AlphaZero, respectively. The dataset with 2 satellites and 2000 was used as a performance benchmark for the algorithms presented in this work. The optimization and hybrid AlphaZero algorithms achieved π 1 completion rates of 98.1% and 98.5%, respectively, while the greedy algorithm only exhibited a 78.5% (63.6% with k-means clustering) completion rate. In the single satellite model, the optimization algorithm reached 100% completion on π 1 , π 2 , and π 3 requests and 96.2% completion on π 4 requests in 6 minutes. This work showed that by using reinforcement learning and optimization models, it is possible to improve the results of mission planning that are otherwise obtained through simple greedy models. This work presents a step towards creating quantumenhanced solutions in the space industry. Fig. 9. The flowchart used to implement the greedy algorithm in Sec. III-B. The time-incrementing block references the "open DTO windows" query. Without accessible DTOs within a cluster, satellites remain idle under the greedy algorithm, with the simulation advancing until availability arises. Upon addressing all requests, satellites transition to the subsequent cluster and replicate the procedure. Note that the connection between one-second time increments and new clusters is implicit rather than explicit.