DeepQPrep: Neural Network Augmented Search for Quantum State Preparation

There is an increasing interest in the area of quantum computing but developing quantum algorithms is difficult. Neural Network augmented search algorithms have proven quite successful for general search problems (like program generation) but current approaches to quantum program generation make very restricted use of them. In this paper we present DeepQPrep, a Neural Network based approach to generate quantum circuits for state preparation; a common yet expensive task needed in many applications of quantum computing. We illustrate that Neural Network augmented search algorithms have significant promise for automated quantum program generation; DeepQPrep generated programs were able to solve 99% and 76.9% of 20000 previously unseen state prepartion tasks in a supervised machine learning context within two different application scenarios. The circuits produced by DeepQPrep are also shallower (on average) than their ground truth counterparts. We also compare DeepQPrep to IBM Qiskit’s approach to state preparation and illustrate that even when constrained, DeepQPrep generates significantly shallower circuits despite Qiskit solving more of the state preparation tasks. Based on our results, we argue that neural network augmented search approaches exhibit significant promise for generalised approaches to quantum program induction warranting further study in more complex scenarios.


I. INTRODUCTION
As improvements in classical silicon based computing are reaching their limits, it is necessary to look into alternative approaches. One promising area is Quantum Computing which makes use of quantum effects, like entanglement and superposition, to solve some problems faster than their classical counterparts. However, applying Quantum Computing involves overcoming various challenges as the design of quantum algorithms requires practitioners with experience in a wide range of fields such as basic quantum physics as well as programming and algorithm design. At the same time just bringing a high level description of a quantum algorithm onto a real device has a multitude of challenges associated to it. We need to create a quantum program that represents the The associate editor coordinating the review of this manuscript and approving it for publication was Wei Huang . desired algorithm for the task at hand. This program needs to be tailored to the physical system, which involves a number of additional considerations. First, to apply any quantum program on classical data requires the computation of a quantum state representation of this data and to prepare the Quantum Computer (QC) with this state. Second, qubit restrictions (e.g. number of qubits available and the connectivity pattern) need to be considered as well as the degree of noise in each operation. Third, there may be specific restrictions or constraints on the design space of the quantum program (e.g. available operations) which need to be considered. Designing these programs and algorithms by hand is a very challenging task. While current approaches can partially automate this process, they are slow and in most cases even computationally infeasible (especially as the number of qubits increases). In contrast, recent results in machine learning augmented search techniques applied to complex search problems have proven very successful yielding super human performance on very challenging search problems such as the game of Go [1].
To address these challenges, we propose DeepQPrep a neural network augmented search algorithm for the automatic creation of quantum programs to show the potential of neural network augmented search to address these challenges. To illustrate the viability of neural network-based search we explore the problem of automatic quantum state preparation. In this regard, DeepQPrep is being applied as a means of quantum program synthesis. There has been some work in the area of quantum program synthesis, yet current approaches to these problems do not scale well as the number of qubits increases [2], can take a long time to compute when the state needs to be approximated with a high degree of accuracy [3], and are often not able to generalise to high(er) cardinality problem sets. State preparation is therefore a suitable initial task for machine learning based quantum program generation as it is quite challenging, has a high relevance in practice, and there are still contributions to be made in its automation.
In order to apply machine learning based methods to quantum program synthesis, there is a need for benchmark training and evaluation datasets. We are not aware of any (large-scale) data specifically tailored for experimenting with quantum state preparation that would lend itself well to a machine learning pipeline. Therefore, to facilitate this work, and act as a future reference benchmark dataset [4] for quantum state preparation tasks, we generated two datasets of 5.7 million and 7.2 million quantum target states aligned to two different similarity metrics and provide corresponding (OpenQASM [5]) representations. The first dataset uses fidelity, a very common metric in Quantum Computing, as the metric of similarity between quantum states. Fidelity ignores the phase components of the states which, in many cases, can be safely ignored. In applications where the phase is necessary, the second dataset uses a minimal similarity metric that considers all parts of the state (including the phase) and remains stable when the number of qubits changes.
Leveraging these datasets, we train DeepQPrep to generate quantum programs that would result in a (previously unseen) provided target quantum state. Our results illustrate that generalisable techniques for quantum program synthesis have significant potential. DeepQPrep was able to generate up to 99% of the unseen target states for the fidelity similarity and 76.9% for the more complex minimal similarity metric. It was also able to produce programs that are ''simpler'' than the initial program used to generate the target state. This illustrates that generalisable machine learning augmented approaches to quantum program synthesis can improve upon their initial training data and that this data does not need to be completely optimised for the learning task. Conversely, existing approaches tend to approximate the target states step by step thereby generating very deep circuits in contrast, DeepQPrep creates a fairly shallow circuit without the need for iterative improvements. The source code of the evaluation and approach can be found in accompanying github repository. 1 In proposing DeepQPrep, we aim to make the following contributions to the quantum program synthesis literature: • evidence that quantum program synthesis can be applied to solve a wide range of quantum state preparation problems; • provide a reference dataset for benchmarking research into machine learning augmented quantum state preparation approaches; • demonstrate that machine learning augmented approaches for quantum state preparation can also be applied to improve the structure of quantum programs used for state preparation. This article is structured as follows: section II provides a general overview of key topics in the article, i.e., quantum programs, state preparation, program synthesis. In section III), we discuss key related work to situate DeepQPrep. In section IV we describe the architecture of DeepQPrep and discuss key design choices. We evaluate DeepQPrep in section V and compare its results to Qiskit [6] (as a gold standard in the area). We compare Qiskit results when specific instruction set restrictions are in place using the Solovay-Kitaev algorithm [3]. Finally, in section VI we conclude the paper and discuss future research directions.

II. BACKGROUND
In this section, as this article spans several domains of the literature, we provide a brief overview of key areas of the literature to introduce key terminology used throughout this article. Namely, we introduce quantum circuits (as quantum programs) in subsection II-A, the general problem of quantum state preparation in subsection II-B, program synthesis in subsection II-C, as well as provide more formal details of the representation of quantum circuits from the perspective of program induction.

A. QUANTUM CIRCUITS
A common model of Quantum Computation is the quantum circuit. Quantum circuits are a formal description of the quantum operations that are to be applied to a fixed number of qubits. They are inspired and defined similarly to Boolean Circuits. A Quantum circuit is specified as a directed acyclic graph (DAG) where a sequence of operations on a qubit are represented as nodes on a path [7]. A DAG is a graph with directed edges without any directed cycles. The depth of a quantum circuit is defined by the number of gates on the longest path that can be taken through the DAG of the quantum circuit. Qubits are the quantum extension of the concept of a bit, however unlike a classical bit a qubit may be in a complex state that may be modified during computation and only resolved to bits on measurement. The measurement is a probabilistic process that depends on the complex state.
One way of representing the state of a quantum computer is the statevector, a deterministic complex valued 2 n q column vector |ψ⟩ (where n q is the number of qubits). A statevector's 2 n q elements represent the 2 n q possible bitstrings bin(i) of length n q , where i is a integer and 0 ≤ i < 2 n q , that can be obtained as a result of measuring our quantum system. The probability of the system being in state bin(i) on measurement is ⟨ψ| M i |ψ⟩, where M i is a 2 n q × 2 n q matrix with all elements 0 except M i (i, i) = 1. Here we use the common bra-ket notation where bra ⟨ψ| is a row vector and ket |ψ⟩ is column vector. Example gates are the X-Gate, which performs an operation analogous to a Not operation in the computational basis (z-basis) we use in this work, or more complex operations like the CNOT-Gate applied to two qubits to potentially entangle them by applying an operation analogous to an XOR. A typical form of storing and describing Quantum Circuits is OpenQASM a programming language designed for describing quantum circuits. When creating a quantum circuit representing a specific quantum algorithm we need to take into account that often only a specific (sub)set of gates are available or supported. These gates might also be error prone as noisy intermediate-scale quantum (NISQ) era [8] QCs tend to introduce errors when applying gates to the qubits. Lastly, the connectivity and therefore the reliable entanglement between qubits tends to be highly restricted on real devices. This means that quantum circuits must be tailored to the constraints of the quantum computer to yield the best results. Thus, tools for automatically optimizing these are extremely important especially to minimize the depth and therefore the accumulation of errors.

B. QUANTUM STATE PREPARATION
To apply a specific quantum algorithm to an input, the quantum computer must be brought into a specific quantum state that represents the input. This problem is known as quantum state preparation and an efficient solution is necessary when one wishes to apply quantum algorithms to classical or quantum inputs. More formally, the goal of quantum state preparation is to find a quantum circuit f prep (or a control sequence for the physical system implementing the QC) so that when applied to the ground state |g⟩ of the QC it recreates the target quantum state |t⟩ as accurately as possible as defined by a similarity metric sim. This leads to the following state preparation problem.
The choice of the similarity function for comparing quantum states depends strongly on the application. A common choice is the fidelity F(|ψ⟩ , |φ⟩) for two pure quantum states |ψ⟩ and |φ⟩. It is defined as follows: The fidelity only takes differences between the amplitude of the quantum state into account but not the phase. To also take the phase difference between the two quantum states into account another similarity metric is necessary. One such similarity metric called the minimum similarity metric, which first normalizes the values in the statevectors and then computes a difference between statevectors. The normalization of the values is done by first concatenating the real Re(·) and imaginary Im(·) components of a statevector |ψ⟩ and then dividing by the absolute maximum value.
Two statevectors with normalized values s 1 and s 2 are then compared using the maximum of the absolute difference: Sim(s 1 , s 2 ) will be called the minimal similarity henceforth. By normalizing the values and taking the maximum instead of the mean the difference is kept stable independent of the number of qubits the system has.

C. PROGRAM SYNTHESIS
Program Synthesis concerns itself with the creation of a program that is able to reproduce a set of expected output given their inputs. The overall goal is to find a program that is able to generalize to new examples. It is quite similar to machine learning and it could be considered to be a subset of it. The main distinction is the goal to create a computer program consisting of program instructions from a library of instructions instead of more mathematical models like artificial neural networks as is typical in machine learning. Common approaches to program synthesis are various search techniques like Monte Carlo search [9], evolutionary algorithms [10] or constrained satisfaction solvers [11]. An overview of approaches to program synthesis can be found in [12]. DeepQPrep stores quantum circuits in a data structure that closely resembles the standard quantum circuit definition, which is a DAG. Basis gate types (X, CNOT, etc) are assigned to nodes. Incoming (and outgoing edges) are assigned to nodes as an n-tuple with a particular order, as required by multi-qubit gates such as CNOT. In our circuits the source nodes (no incoming edges) are the initialization gates, the sink nodes (no outgoing edges) are the output or measurement gates.
During circuit construction our implementation temporarily adds a special type of edge to the circuit which does not connect to a vertex in the graph, we refer to these as leaf edges.

III. RELATED WORK
In order to perform state preparation, a quantum program, often represented as a circuit, is used. Thus, the task to arbitrarily perform state preparation becomes a search for 76390 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
''appropriate'' or ''good'' quantum programs that yield a specific output, or quantum state, when executed on a QC. A common approach to state preparation is the use of various decompositions like sine cosine decomposition in [13]. Reference [14] propose an efficient circuit design scheme that allows the use of universal decompositions and [2] makes use of a divide and conquer approach to state preparation. Reference [15] focuses on the preparation of sparse target states. In contrast DeepQPrep work focuses on using machine learning for quantum state preparation as an initial step towards the more general problem of quantum program synthesis. To this end, it uses a restricted gate set not using any parameterized gates as these might require additional optimization steps thereby making the results generated by DeepQPrep harder to attribute to it. This work focuses on the overall selection of the gates and not their parameterization (as this would involve additional forms of optimisation in addition to generating the circuit) and therefore uses no parameterized gates at this time.
Search techniques for finding programs given specific input output pairs have been explored in the area of program synthesis resulting in different research directions and approaches that could be leveraged for state preparation. One approach that could be taken is to define parametric programs using an expressive language, compiled into constraints and solved using a SAT solver [16] as is done in Sketch [11], [17] and [18]. An alternative would be top down program synthesis first introduced in [19] and [20]. Here, a program is synthesized by starting at an output and subsequently filling required inputs using deterministic and probabilistic search techniques.
Probabilistic search techniques can be expanded with Deep Neural Networks predicting the probability of appearance of an operation in the final program based on an encoding of the input-output examples as was done in [21]. This can be improved upon further by not predicting the probabilities of each operation globally for the whole program but by viewing the synthesis process as a sequence-to-sequence generation process using attention over the input output examples [22] or by directly encoding the tree structure of the program using recurrent neural networks [23]. By introducing a Wake Sleep cycle containing the abstraction of sub-programs into new functional parts as was done in DreamCoder [24] the training and search can be improved further by reducing the depth complexity of the search space. While previous work has used the optimization of variational surrogates or supervised learning, other work has used Reinforcement Learning algorithms (e.g. [25]) for program synthesis. This can be beneficial as it does not require ground truth programs to be available. Reinforcement learning can be augmented with a memory containing the best solutions already found for the examples [26] to prevent catastrophic forgetting. Approaches used for neural architecture search [27] can be used as well for program synthesis by formulating them as differentiable process [28].
While the previous approaches were applied to various domains (and discussed here to give a brief overview of the program synthesis literature) there are approaches directly aimed at quantum program/circuit synthesis. A natural representation for quantum circuits is not a tree, as is the case for classical programs, but a Directed Acyclic Graph (DAG) [29]. The input and output pairs for a quantum program synthesis algorithm can be unitary matrices or high level language descriptions. Building a quantum circuit out of elementary gates to implement a target Unitary matrix can be done using a combination of A* [30] search and numerical optimization [31]. As A* requires a good heuristic an alternative is the use of Householder transformations [32] to decompose the target unitary into sub matrices which will then be represented by elementary gates. There is evidence that first decomposing the target unitary into smaller unitaries by creating fixed position and variable position unitaries with numerical optimization improves things further as was done in QFAST [33]. After a sufficient number of decompositions it can make use of various synthesis algorithms to solve the local synthesis problems that are created. While A* search and decomposition based techniques are quite orthogonal to each other alternative search approaches can be taken as well. The Quantum Computing framework Qiskit [6] does quantum state preparation by making use of cosine-sine decompositions based on [13]. While the cosine-sine decomposition might not be the best approach, it is widely used through Qiskit making it a great baseline for comparison.
Reinforcement Learning algorithms like Advantage Actor Critic (A2C) [34] and Proximal Policy Optimization (PPO) [35] can be used for state preparation [36] by generating a circuit that brings the QC into the target state minimizing a specified cost function like fidelity. Besides A* search and Reinforcement Learning, Bayesian optimization can have merit as well as was shown in [37]. Here, quantum circuits based on high level language descriptions of the target, which are translated to input-output pairs, were generated. A combination of top down search and bottom up can be employed as well to get the benefits of both [38].
It is important to stress that our intention in this article is to evidence the potential of machine learning augmented search for quantum program synthesis. Therefore, the approach to state preparation and the gate set we use is selected with the machine learning in mind. Other approaches available for state preparation use a larger gate set, but require an additional optimisation routine to parameterise or otherwise optimise parts of the circuit. This would not make sense (at this stage of our work) because it would be too difficult to differentiate between a good optimisation process and a good ''base'' circuit that is then optimised.

IV. QUANTUM CIRCUIT SEARCH HEURISTIC
The goal of DeepQPrep is to generate a circuit given the target quantum state it should create when applied to i.e. the ground-state. DeepQPrep is composed of three different parts VOLUME 11, 2023 FIGURE 1. High-level overview of the approach illustrating the main three stages: 1) generating a (benchmark) dataset; 2) training the model components end-to-end; and 3) evaluating the approach with an unseen test set. as shown in Figure 1. The first part establishes a dataset by pairing circuits sampled uniformly at random with the statevector the circuit induces on an initial state the ground state. The dataset contains the shortest circuit sampled for each unique statevector. Such a dataset can be used for researching and benchmarking new machine learning based approaches for the task of circuit based quantum state preparation. Two different datasets were created aiming at different similarity metrics between quantum statevectors. One uses the fidelity and the second uses a metric based on the minimal similarity thereby incorporating the phase and not just the amplitude as is the case for the fidelity.
The second part uses this dataset of target statevectors and the circuits that created them to train a NN Heuristic that predicts the next operation(s) to be added to the circuit. It does this by maximizing the probability of selecting the next gate from the target circuit. During training the target state is encoded using the State Encoder and the edges making up the DAG of the target circuit are encoded using the Edge Encoder. Then the predictions of the next operation to connect to the leaf edges are made using the respective edge encodings and encodings of the state. In practice the whole graph is available directly whereas during search the edge encoding for each leaf edge of a new operation is computed when it is added to the graph and will then contribute to predicting the next operation. Finally, in the third step, DeepQPrep is applied to new unseen statevectors by repeatedly adding gates to a partial circuit, starting with circuit containing only the output operation, and returning the m most probable complete circuits. Each of these m circuits are executed to find which one best matches the target state which will be returned as the final answer.

A. DATA CREATION AND PREPARATION
We generate and make available two datasets to facilitate reproducibility and compare DeepQPrep with other circuit based quantum state preparation methods. We hope that these datasets will be used by other researchers to train or benchmark their state preparation methods. The datasets are lists of state circuit pairs, where each circuit produces the target state when applied to the ground state. The difference between the two datasets is the similarity metric used to decide whether two circuits generate the same state or not. The first dataset is for tasks focused on finding circuits minimizing the fidelity between the target statevector and the statevector of the found circuits (i.e. state phase is ignored). The second dataset is for tasks where it is important that the found circuits create the target state faithfully including the phase.
The datasets were created by a uniform random sampling of circuits and computing the state they generate when applied to an initial state. The uniform random sampling was done by performing randomized search on an initially empty circuit. At each step a gate or the initialize operation from the set of possible gates is sampled uniformly and added to the circuit by selecting the qubits not finished yet uniformly as well. This is done from the back to the beginning similar to section IV-C. Besides quantum gates, an initialize operation is used as well to flag a qubit as being finished. When all qubits are finished or the search reaches a target depth the circuit is returned and evaluated. As multiple circuits can create the same state, only the circuits with the lowest depth are included in the dataset. Two statevectors |ψ⟩ and |φ⟩ are considered to be the same if Sim(|ψ⟩ , |φ⟩) > 0.999 or F(|ψ⟩ , |φ⟩) > 0.999 to account for numerical precision. To find whether a new state generated by a new circuit has a lower depth, each new quantum state must be compared to all other existing quantum states (already in the dataset), by computing the respective similarity metrics for both data sets and finding the maximum. Thus, as the number of sampled circuits increases the probability of a low depth circuit corresponding to a target state increases. Circuits for states that are only concerned with the fidelity tend to be shallower, i.e. have lower depth. Since the fidelity metric only depends on the state amplitude and not the phase, states that have the same fidelity contain multiple different states of the states that have the same amplitude and phase as considered by the minimal similarity metric and therefore a higher chance of finding a shallower circuit.
Our datasets use the basis gate set X, Y, Z, H, T, T † , S, S † , and CX. The chosen set of gates is universal [3] and by using T † and S † some states can be created using shallower circuits. While our gate-set is universal it requires deep circuits to represent arbitrary rotations. In practice it would be preferable to add parameterized gates to the basis gate set, however they require another layer of optimization. In this work we investigate the effectiveness of DeepQPrep, which does not support parameterized gates and so we exclude them from our gate set.
Overall, 5.7 and 7.2 million examples were generated for the fidelity dataset and similarity dataset respectively. The mean depths are 14.1 and 18.3 and the mean total number of gates are 28.5 and 34.7. The number of qubits for the target states ranges from 2 to 5 which makes it more likely that it contains examples with a fairly high complexity while at the same time enable the sampling a large set of possible target states and corresponding circuits. During sampling of the circuits the number of qubits were distributed evenly. Since it is more likely to find different states for more qubits, given the search setting, there is an imbalance in the dataset; The dataset contains more high qubit circuits overall. The datasets contain the ground state with the identity circuit but also states computed by components of Grover's algorithm [39]. These datasets (the first reference datasets for state preparation that the authors are aware of) are provided to train state preparation models for a wide range of state preparation tasks and needs as well as to compare and contrast new and existing approaches.  [4]. The statistics of the trainings dataset can be found in Table 1.

B. NEURAL NETWORK HEURISTIC TRAINING
The Neural Network Heuristic component of DeepQPrep (see Figure 1 middle and Figure 3) learns how to compose a quantum circuit gate by gate, evolving the system from the base state to a given target state as can be seen in Figure 2. The datasets from subsection IV-A act as training data. To facilitate learning, the neural network needs to be able to do three things. First, it needs a way to make use of the target statevector. This is achieved with a State Encoder Neural Network NN state . Second, it needs a way to use the DAG of the partial circuit. This is achieved with an Edge Encoder Neural Network NN edge . Third, it needs to be able to efficiently combine the results of the previous two steps. This is achieved with a Predictor Neural Network NN pred that combines the results of both previous parts to predict the probability of each gate for each qubit. While all three parts are trained end to end, a distinction is made as they all have quite different architectures and different patterns of application during search (i.e. the third step in Figure 1). All three components, the edge encoder, state encoder and predictor, are necessary to achieve the best results and are visually summarised in Figure 3.
The State Encoder (see Figure 3) takes as input a quantum statevector of size 2 n and transforms the complex valued statevector into a real valued tensor of shape n×N features . This compressed state encoding is a more suitable representation for a Neural Network as they commonly act on real valued tensors of a specific size or a sequence there of.
The encoder concatenates the real and complex parts of the statevector, in an additional dimension, resulting in a real valued vector of shape 2 n × 2; for example 0.1+0.9i 0.4−0.3i would become 0.1 0.9 0.4 −0.3 . Given a statevector with r elements we create a tensor P of size r where P i = bin(i), 0 < i < r. The quantum statevalues and the bit strings, P, are passed to the state qubit encoder. The intention is to give the neural network a link between statevector values and qubit position. It creates VOLUME 11, 2023 FIGURE 3. Neural Network Heuristic Architecture comprised of the State Encoder: captures the information concerning the target state; the Edge Encoder: captures information from the current partial circuit) and the Predictor: the supervised learner that predicts the next operation and qubit(s) for it to be applied to. a quantum state bitstring encoding for each qubit and each position in the statevector, a n × 2 n tensor. While a positional encoding as in [40] could have been used as well we found in practice the proposed approach works better as the number of qubits increases. An example of this for 3 qubit state encoding features is reordered as follows for qubit 2 and 3.
The first feed forward layer of the state encoder creates an initial encoding of the quantum statevalues. The bit embedding creates an embedding vector of each bit in the bit string (using something akin to word2vec [41]) in order to create a parameterised feed forward layer without a bias tensor. This is followed by a bidirectional Gated Recurrent Unit (GRU) 2 that computes an encoding over all bit embeddings. By using a bidirectional GRU the encoding for each bit incorporates data from the whole bitstring, which ensures that the individual bit encodings are sufficiently different. While Transformer layers could have been used as well to combine the data they essentially have a bias to attend to parts of the input but for the bitstring encoding all parts are of equal importance and this could be thought of being a character based word embeddings which is often done using bidirectional RNNs [44]. The result is combined with the quantum statevalues by applying a linear layer with a tanh activation and without any bias thereby creating a fixed scale result. This is followed by another feed forward layer combining the encoding of the quantum statevalue and the previous result thereby giving it another chance to incorporate data from the statevector elements.
The output of the quantum state qubit encoder is passed to the permute operation which changes the order of the quantum statevector encoding for each qubit so that the statevector values are ordered as if the specific qubit is the first qubit in the bitstring (see equation (6) for an example). The purpose of this permutation is that it allows different results for different qubits as the other parts of the model only act on two adjacent elements of the statevector. By reordering the statevector data for each qubit the neural network will create multiple different encodings. The permuted encoding runs through a pass through layer (which is akin to a skip connection as discussed by [45]) that is used to improve the gradient flow. A mean operation is then applied to the permutation to reduce the size of the resultant encoding to be consistent with the number of qubits. The produced encoding g enc,0 is used to capture information concerning the combined state e.g. a tensor containing information from the whole statevector for each qubit as it is a function of the whole statevector it is therefore called the Combined State Encodings. Its purposes are twofold: first it is only updated with addition after creation g enc,i+1 = g enc,i + f upd (g enc,i , s enc,i ) with s enc,i being the encoding of the state at step i with dimension n×2 n−i−1 ×l emb where l emb is the size of the embedding space, thereby improving the gradient signal, even when the number of iterations in the following merger layer is high; second it allows the iterations in the merger layer to access information from all parts of the statevector s enc,i+1 = f merg (g enc,i , s enc,i ). It is passed in conjunction with the permuted quantum statevector to generate the final output in the merger layer. The combined state is only updated by addition in order to improve the gradient flow through the model. The merger layer iteratively takes the combined state and the output of the permutation in order to reduce the size of the state dimension by two in each iteration and update the combined state additively. After n iterations it returns the combined state and a vector that has its state dimension reduced to one. The combined state is created by applying a tanh-attention mechanism computed in the following way: Both results are combined in the last feed forward layer creating the final encoding.
The Edge Encoder Neural Network computes encodings for edges and nodes (input, output and operations). A circuit is represented as a DAG and each edge is transformed into a tensor. The purpose of the edge encoder is to transform (i.e. encode) a quantum circuit into a representation that is suitable for a Neural Network to process.
The edge encoder processes the edges of the circuit from the end (i.e. the node of the output operation) of the circuit until the initialise operation is reached, thus encoding the DAG structure. Each in-going and out-going edge, of each operation, has a specific ID assigned to it which is used to retrieve an embedding vector from a set of vectors containing encodings for all in-and out-going edges for all operations. The embedding vectors are similar to the ones used in [41] for word embeddings. This enables the edge encoder to differentiate between different inputs and outputs of each operation. The encodings of the edges going into the node (operation) are concatenated with the respective operation input embeddings and passed to a GRU Cell. The GRU Cell keeps the scale of the data fixed independent of the number of applications of the edge encoder. This is important because the number of applications of the same layer depends on the depth of the circuit. The output of the GRU cell for each is combined by calculating a mean over all input edges to compute the node (operation) encoding. By using a mean instead of applying it recurrently the computation is more efficient as the edges don't need to be processed in sequence. The node encoding is then concatenated with the embedding of the output operation ID(s) and passed to a GRU cell to compute the output edge encoding(s). The embeddings of the input and output IDs are generated separately using distinct embedding matrices. These encodings can then be used to predict the probability of each operation potentially added to them. After a new operation is added to the graph the Neural Edge Encoder Layer is applied to compute the encodings of the output edges of it.
The Predictor neural network processes leaf edges in order to prepend one operation onto them. It does this by applying a transformer to the set of all leaf edges and their corresponding target state encodings. Its purpose is to predict operations inductively and in an iterative manner such that all leaf edges eventually have a intialise operation prepended to them.
For the leaf edges, the quantum state encoding of the respective qubits is concatenated with their edge encodings. This is passed to a feed forward layer followed by a transformer layer. 3 The transformer allows the Predictor to focus on (attend to) information from important parts of the input. The result is passed into two different feed forward layers. One is used to predict the probability of each edge having a specific operation added to it. The second feed forward path performs a sum over the edge dimension to reduce n tensors (one per edge) to produce one tensor for all edges. The final feed forward layer predicts a probability table with each operation that captures the likelihood that each operation could be prepended to the current edge, and thus captures a conditional probability distribution of operations for the current edge. This enables the prediction of the probability 3 Transformer layers are layers applied to a sequence of input to generate a new sequence of output of the same length using a attention mechanism. It was proposed in [40]. VOLUME 11, 2023 that specific operations could be prepended to the overall graph.

C. EVALUATION AND EXPLOITATION
The final stage of DeepQPrep is the exploitation of the trained neural network heuristic, i.e. by supplying it with a target statevector such that it (ideally) emits a circuit that can accurately reproduce the target statevector. We note that there are multiple circuits that can reproduce the same statevector, and that some can be evaluated as ''better'' than others depending on the design objectives. Thus, we can not only evaluate how well a circuit reproduces the statevector, but also consider other aspects of its composition; for example, its depth or whether specific operations (gates) are used. To generate quantum circuits, DeepQPrep takes a number of target quantum statevectors. For each statevector the state encoder encodes the statevector, and new circuits are generated that contain only the output node and a set of leaf edges: one for each qubit. To instrument the search process, i.e. to generate circuits, the edge encoder and the predictor are invoked in an iterative manner to produce a set of candidate circuits to be evaluated.
The search is only performed during evaluation as the training of the neural network heuristic makes use of the ground truth data, i.e. is an instance of supervised learning operating on a circuit known to produce a specific statevector. The search algorithm uses the neural network heuristic to predict the probability of each next operation (gate) that can be added to a partial circuit. Besides gate operations, the initialize operation is used to signal that no more gates will be added to a qubit. This also illustrates that the circuit generation process, like the training process, prepends operations from the end (an output operation) to the start of the circuit (an initialize operation).
The search commences by filling a priority queue with N partial circuits containing only the output nodes. Priority is defined using the probability emited by the predictor for the last operation that was prepended (the output node will have a probability of 1, and thus avoids a cold start problem). The search algorithm selects the partial circuit with the highest probability, and then independently samples one operation according to the operation's probability distribution and prepends it to the current partial circuit. This new partial circuit is added back into the priority queue, and in the next iteration the most probable partial circuit is selected again, and the process repeats. When a circuit is complete, i.e. has no leaf edges, it is added to the result set, and when this result set reaches a pre-specified number potentially smaller than N the algorithm terminates. This is therefore an instance of stochastic search with a bias towards more likely circuits. The bias towards more likely circuits is important to give the heuristic the most influence on the results and therefore also potentially minimizing the number of circuits that need to be evaluated. We also explored the use of a greedy strategy that always prepended the most likely operation (like beam search), yet this was found to perform worse: that is less likely to produce circuits that accurately reproduced the target statevector.

V. EVALUATION
We show that Neural Network augmented search techniques are a viable approach for generating quantum circuits. We are more concerned in this article that the model is able to produce circuits than out-performing current state of the art approaches for state preparation: our goal is to illustrate that quantum circuit synthesis via machine learning can be generalised for arbitrary circuit production given any target statevector. As such, we structure the evaluation of DeepQPrep as follows: first in section V-A we show DeepQPrep's ability to generate circuits that accurately reproduce the target statevector (termed as solved problems) according to different metrics of similarity, i.e. fidelity (DeepQPrep solves 99%) and minimum similarity (DeepQPrep solves 76.9%) as discussed in subsection IV-A). This corresponds to a fairly standard machine learning evaluation approach. Second, we compare the differences in the composition of the generated circuits to the composition of the circuits that originally produced the target statevector in subsection V-B (DeepQPrep generates shallower circuits than the ground truth circuits). Finally, we compare DeepQPrep to Qiskit's own state of the art approach for state preparation in subsection V-C. This has several goals: 1) to show that the test and training sets DeepQPrep is being applied to are reasonable; 2) how the performance compares to Qiskit in terms of solving the state preparation problem and circuit composition; and 3) how DeepQPrep has increased flexibility over Qiskit in terms of circuit design constraints that we have imposed. Qiskit, and its state preparation algorithm, is widely used in the field of quantum computing and so is a good comparison target representing an approach commonly used in practice.
We use a simple holdout sampling strategy common in supervised machine learning. We isolate a portion of the dataset generated in section IV-A creating a set of The examples are ground truth state preparation circuits: we know their input (the ground state), their output (2 n complex numbers, where n is the number of qubits), the operations (gates) used, and depth as a quantitative measure of circuit complexity (shallower have lower complexity deeper have higher complexity). These circuits were chosen uniformly from the set of all possible state, circuit pairs. We derive the performance of DeepQPrep by characterising how many target states are exactly reproduced and their depth (lower is better). A match is important as a circuit that ''almost'' reproduces the state is as useful as one that completely misrepresents the target state. Thus performance corresponds to the percentage of solved test examples but also on the number of different circuit that need to be evaluated and how the depth and number of qubits influence the result. Our evaluation is performed in simulation using IBM's Qiskit toolkit [6], and training is performed using a high performance compute cluster.
The Neural Networks are trained end to end using the Negative Log Likelihood between the target operation and the probability returned by the Neural Network for it. The same is the case for the leaf edges but here it is applied on the ground truth edges for that operation and the probability of those as predicted by the NN. We used the Adam optimizer with a learning rate between 0.0001 and 0.0005, as it has proven successful in a wide range of application. The parameters were found using randomized grid search. The reported results were achieved with α = 0.0005, β 1 = 0.5 and  [46] which is 6.4 for 5 qubits. This allows the dataset to contain examples that are not optimal in depth. 55% of the examples have a depth smaller than 6, 45% have a depth bigger or equal than 6 and 27% have a depth higher than 15. This shows that the example set contains a large range of examples with varying difficulty as the number of decisions that need to be made are quite high. The search contains a priority queue of N = 1000 potential circuits sorted by probability. Each batch of size 64 contains only examples with the same number of qubits in order to make its computation efficient. While this might have an influence on the performance of the stochastic gradient, in practice it was found to be negligible. The neural networks used an embedding size of 64, the decoder used only one transformer layer with 8 attention heads, dropout of 0.1 and a feed-forward size of 256. The parameters were found to work best by performing a randomized search over the set of possible parameters. The Hyperparameter settings were used on training for both datasets, the fidelity and minimal similarity based dataset leading resulting in the two models DeepQPrep_Fidelity and DeepQPrep_Min_Sim respectively. All results are reported with respect to these two trained models.
During evaluation the most probable 500 circuits, as predicted by the Heuristic Neural Networks, were taken and applied to the QC initial state. Of these 500 the circuit generating a statevector with the highest similarity to the target statevector is chosen as the output. In the case of a tie, the shallower circuit is chosen. For the final evaluation, 20000 target states were evaluated in total. In section V-A the training behaviour of DeepQPrep when evaluated on the validation set during training is depicted with an analysis covering how many circuits need to be evaluated to reach a specific % of examples of the test dataset solved. Subsection V-B presents an analysis of the properties of the created circuits using the test dataset. Finally, subsection V-C compares the results generated by DeepQPrep with results created by Qiskit and the Solovay-Kitaev algorithm using the validation and test dataset. As the number of training epochs increases, the number of solved circuits behave differently. The y-axis indicates the evaluation function score on the validation dataset, note that the value here is on two different datasets. The x-axis shows the number of epochs for which the model was trained before evaluation. Each line indicates a single representative instance of our NN being trained on the fidelity dataset (blue), and the minimal similarity dataset (orange) respectively. The horizontal lines represent the results of the untrained models which is very close to uniform. Figure 4 shows the results for both datasets. The Neural Networks were trained for 45 epochs on the fidelity dataset and 120 epochs on the minimal similarity dataset respectively. The training on the fidelity dataset was stopped after 45 epochs as no more changes happened and the NN converged. In the figure it can be seen that the number of solved circuits in the validation dataset for the fidelity similarity metric dataset starts out at roughly 39% before training and solves 99% of all circuits after a few epochs. However, we can also see that the results for the minimum similarity dataset starts out at 0% solved examples and goes up to 60% solved circuits after 5 epochs. From there it slowly rises further up to 78.6% examples solved after roughly 60 epochs. This clearly shows that learning took place in both cases, and that more circuits are solved as training progresses for more epochs. Overall, learning to predict circuits for the fidelity similarity metric dataset is faster as it converges after a few epochs whereas the models trained on the minimal similarity metric dataset take roughly 60 epochs until they converge, which makes sense as this metric is stricter. Figure 5 shows the number of different circuits evaluated for each target state of the test set (x-axis) necessary to solve a percentage of them (y-axis) after training. It can be seen that for the fidelity similarity metric dataset roughly 78.5% of target state examples were solved taking only 10 sampled circuits into account. After taking the most 50 probable circuits into account more than 90% of the circuits were solved. Taking the best 100 results into account improved the number of solved circuits by additional 9% to 99%. The minimal similarity metric dataset on the other hand requires 60 different circuits to be evaluated to solve 50% of the circuits. VOLUME 11, 2023 76397 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  After 100 evaluations roughly 58% of circuits were solved. This rises further until 76.9% of circuits were solved after 500 evaluations (not shown in Figure 5). The main difference between the fidelity and minimal similarity metric dataset is that the circuits for the minimal similarity metric dataset contain more operations making the problem therefore a lot harder.

B. OBSERVATIONS ON CIRCUIT COMPOSITION
The violin plot in Figure 6 depicts the depth differences between the best solutions for the solved examples, the corresponding ground truth circuit and the circuits found by Qiskit, discussed in subsection V-C, for the minimal similarity dataset. From the distribution of depths of the ground FIGURE 7. A histogram of the relative number of times a gate was used in each circuit for the ground truth circuits (blue) of the test set and the found result circuits (orange). The found results are found by the previously discussed NN after training for 120 epochs on the minimal similarity dataset.
truth circuits it becomes clear that there are quite a few deep circuits in the test set as the mean is 18.6 with a standard deviation of 7.1. The reason is that the dataset was generated by randomly sampling circuits and there are states that can only be reached by applying many T and H gates. Additionally, the violin plot in Figure 6 shows that more circuits have a lower depth in the found solution set of circuits compared to the ground truth set, the mean for it is 10.5 and the standard deviation is 6.5. The majority of circuits found by the NN augmented search have a depth of 10 or below that whereas, the number of circuits with higher depth drastically decreases. A similar, yet not as pronounced result also occurs in the fidelity dataset (mean solved = 0.99, mean depth fidelity dataset = 4.4, and standard deviation fidelity dataset = 3.5). This shows that the depth of the solutions found by DeepQPrep clearly outperform the depth of the ground truth circuits even though that was not incentivised during training, and that the training data is clearly not optimal in regards to depth. The training was performed to maximize the probability of the ground truth circuits using a cross entropy loss. This is significant, as state preparation is a humanly difficult task, but ''optimal'' examples are not needed to train the model.
A look at Figure 7 shows that the relative number of gates used is pretty similar between the ground truth circuits and the found circuits. The main difference is that found solutions contain more H gates relatively and slightly fewer for all the other gates except the Tdg gate which was used more often. The found solutions contain 19.3 operations on average compared to 32.1 operations in the ground truth dataset. This means that the NN augmented search is able to find solutions with fewer gates overall, discarding ones that are not needed for the specific problem at hand.
The results in Figure 8 depict a heatmap of maximum difference similarity as a function of depth of the ground truth circuits (x-axis) and number of qubits (y-axis). The white rectangles represent spaces without any examples which is mainly the case for examples with a lower number of qubits and low depth. This occurs as the depth required to bring a QC into a specific state is lower for smaller numbers of qubits, and therefore it is more likely that the data generation process will find a shallower solution. The heatmap shows  that the similarity drops as the number of qubits and the depth increases. This is probably due to the fact that these results require more decisions as a higher depth and a higher number of qubits increase the overall number of operations necessary to reach the state. Figure 8 also shows a lower similarity for examples with only 2 qubits. The main reason is probably due to the training dataset containing less examples for it as is the case with test dataset. The main difference for this is that there is a higher chance to find circuits generating the same state for two qubits than for more qubits when sampling them at random. Table 2 captures the distribution of circuits against qubits in the evaluation test set to illustrate that there are difference due to there being a higher chance for finding different states for more qubits given a maximal depth of the circuits and no parameterized gates.

C. COMPARISON WITH QISKIT
We compare the circuits generated by DeepQPrep against those generated by the state preparation functionality of Qiskit. The Qiskit library has implementations of many circuit generating algorithms. We compare the default ''circuit transpiler'' which allows parameterized rotation gates which are not available to DeepQPrep. We also compare against the Qiskit 4 implementation of the Solovay-Kitaev algorithm 5 [3], a standard technique to approximate arbitrary quantum states using a finite set of fixed quantum gates. We specify that Qiskit first generates a circuit with paramterized gates, then applies the Solovay-Kitaev transform using the same set of gates as DeepQPrep (X, Y, Z, H, T, T † , S, S † , CX). All states generated by these circuits will be evaluated with respected to the target quantum state using the minimal similarity metric dataset since all the methods attempt to match the target state phase. Figure 9 shows the distribution of the minimal similarity for the different approaches on the validation dataset. The results are only reported for the validation dataset as the Solovay-Kitaev algorithm takes too long to run on the 20000 examples of the test dataset, this is reasonable as the overall results for Qiskit and DeepQPrep look the same on the test dataset (not shown here). The figure shows that Qiskit is able to solve all of the provided statevectors as it achieves a mean minimal similarity of 1.0. DeepQPrep in contrast achieves a mean similarity of roughly 0.92 which is still higher than the best results achieved with the Solovay-Kitaev algorithm which achieves only a similarity of 0.8 with a recursion depth of 7. The results of the Solovay-Kitaev algorithm only get worse as the recursion depth is decreased, as is expected. At a recursion depth of 5 and 6 it achieves a mean similarity of 0.58 and 0.76 respectively. This shows that the target states are difficult to create when a non-parameterised gate set is available but become easier to create when parameterized gates are available as is the case for Qiskit.
An important aspect to note is that the execution time for one target state is a lot higher for the Qiskit implementation of the Solovay-Kitaev algorithm than it is for DeepQPrep. It takes roughly 20 seconds on average for a recursion depth of 5 and 1640 seconds with a recursion depth of 7. In contrast DeepQPrep takes only 2 seconds to generate one solution using the same hardware. Qiskit takes less than a second for most examples. The training on a 100000 examples and 120 epochs takes roughly 2 days without the use of GPUs on the same hardware. In principle the computation time allocated to DeepQPrep could be increased to increase the probability of finding a solution. The solutions found by the Solovay-Kitaev algorithm are often only approximations with a very high depth while DeepQPrep is able to find circuits creating the target state with a low depth. The depth of the circuits found with the Solovay-Kitaev algorithm have a mean depth of 32000 for a recursion depth of 5 only growing further as the recursion depth is increased. In contrast, the mean depth of circuits found by DeepQPrep is 8.8 which is significantly lower.
The circuits produced by the Qiskit default transpiler are significantly deeper (mean depth 77.8) than the circuits that initially created the target states (mean depth 17.0) (see Figure 6). While DeepQPrep produced circuits with a mean minimal similarity of 0.92 the mean depth was 10.5.

D. SUMMARY
We have shown that Neural Network augmented search can become quite proficient at generating quantum circuits for state preparation. DeepQPrep solved 99% of all unseen examples for the fidelity dataset and 76.9% for the minimal similarity dataset. At the same time, it produced circuits that are much shallower than the ground truth ones, and other available options in the literature (Qiskit and the Solovay-Kitaev algorithm). It generates circuits faster than the Solovay-Kitaev algoirthm. This clearly marks this approach as a viable alternative to other approaches currently available in the literature, but with the added benefit that it can cater for a wide range of applications and does not need to specifically address individual state preparation problems. For example, it could be applied by applying the state encoder neural network to multiple states and combining the results thereby allowing the application to the generation of circuit to be used as functions. This is especially poignant considering that the training set contained sub optimal circuits.

VI. CONCLUSION
Our results (section V) show that the NN search heuristic has learned to create circuits able to prepare a specific quantum state, and that the programs (circuits) produced are shallower on average than the ground truth circuits used to evaluate the approach as well as circuits generated by Qiskit. As both the evaluation example circuits and training example circuits are generated by the exact same distribution it is safe to assume that the training data is not optimal in regards to the depth of the circuits implying that the approach is robust towards non-optimal training examples. Overall, the final mean fidelity is 0.99 and it solved 99% of all evaluation examples after training meaning that it has solved a large portion of the examples. For the minimal similarity based dataset 76.9% of examples were solved. This may be due to a requirement for deeper circuits to represent the full statevector, i.e. including phase. At the same time it was shown that the performance degrades as the number of qubits increases, and thus additional investigation is needed to explore scenarios with larger numbers of qubits. However, our results suggest that there is merit in more structured investigation of how to leverage machine learning for generic quantum state preparation. We also provide a reference benchmark dataset to enable future endeavours in this respect. There is a big difference between how approaches try to solve a given problem. Approaches like the sine cosine decomposition on which Qiskit is built or the Solovay-Kitaev try to approximate the target in an iterative manner by evaluating the current approximation and progressively adding more operations, thus increasing circuit depth. DeepQPrep in contrast tries to find a circuit without ever reevaluating it during search but only after the search has returned possible circuits. It therefore attempts to find a good solution in one go by making use of what it learned during training to hopefully find an exact circuit.
Future work should aim at improving the ability of NN heuristics to scale to examples with higher numbers of qubits. Possible approaches could make use of techniques like automatic library discovery or hierarchical search to achieve this goal. At the same time it is important to explore how the results change when the training dataset contains ground truth circuits that are more optimal in regards to depth. On the other hand training without ground truth examples but an always improving dataset of solved examples should be used as well thereby also reducing the requirement of having datasets with high quality ground truth circuits. Additionally, current approaches focus mainly on using quantum state preparation but the same technique can be applied to more complex task where a circuit should represent a quantum function to be computed. Additionally, applications to state preparation tasks requiring the use of parameterized gates is important as well. We used simulated quantum programs for this work, and as such there are assumptions corresponding to error in the circuits which on a real QC could affect fidelity in different ways as well as the fact that on real QC the state is not directly accessible. However, using simulation also allows for the representation of error models and different ansätze. Moving forwards, it would be prudent to also investigate how error affects the produced circuits. Investigation into creating circuits in an iterative fashion passing the Neural Network the currently by the circuit generated quantum state should be done as well as it would allow the Neural Network to more easily alleviate errors that where done in previous steps of the search.