Graph kernels encoding features of all subgraphs by quantum superposition

Graph kernels are often used in bioinformatics and network applications to measure the similarity between graphs; therefore, they may be used to construct efficient graph classifiers. Many graph kernels have been developed thus far, but to the best of our knowledge there is no existing graph kernel that considers all subgraphs to measure similarity. We propose a novel graph kernel that applies a quantum computer to measure the graph similarity taking all subgraphs into account by fully exploiting the power of quantum superposition to encode every subgraph into a feature. For the construction of the quantum kernel, we develop an efficient protocol that removes the index information of subgraphs encoded in the quantum state. We also prove that the quantum computer requires less query complexity to construct the feature vector than the classical sampler used to approximate the same vector. A detailed numerical simulation of a bioinformatics problem is presented to demonstrate that, in many cases, the proposed quantum kernel achieves better classification accuracy than existing graph kernels.


Introduction
An effective measure of the similarity between graphs is necessary in several science and engineering fields, such as bioinformatics, chemistry, and social networking [1]. In machine learning, this measure is called the graph kernel, and it can be used to construct a classifier for graph data [2]. In particular, a kernel in which all subgraphs are fully encoded is desirable, because it can access the complete structural information of the graph. However, constructing such a kernel is known as a nondeterministic polynomial time (NP)-hard problem [3]. Alternatively, a kernel that encodes partial features of all subgraphs may be used, but to our best knowledge, this type of method has not been developed thus far. Previous studies have instead focused on using different features originating from the target graphs, such as random walks [3,4,5], graphlet sampling [6], and shortest paths [7].
A quantum computer may be applied to construct a graph kernel that covers all subgraphs, because of its strong expressive power, which has been demonstrated in the quantum machine learning scenario [8,9]. More specifically, the exponentially large Hilbert space of quantum states may serve as an appropriate feature space where the kernel is induced [10,11,12]. This kernel can then be further utilized for machine learning. We now have two approaches: the implicit (hybrid) approach, in which the quantum computer computes the kernel, and the classical computer uses it for machine learning, and the explicit approach, in which both the kernel computa-tion and the machine learning part are both executed by the quantum computer.
In fact, there exist a few proposals for the quantum computational approach to construct graph kernels. For example, Ref. [13] proposed the Gaussian Boson sampler, which estimates feature vectors by sampling the number of perfect matchings in the set of subgraphs. Another method used a quantum walk [14,15]. However, there is no existing graph kernel that operates on a quantum circuit to design the features obtained from all subgraphs.
In this study, we propose a quantum computing method to generate a graph kernel that extracts important features from all subgraphs. The point of this method is that the features of an exponential number of subgraphs can be effectively embedded to a quantum state in a Hilbert space using quantum computing. Note that a naive procedure immediately induces a difficulty; the corresponding quantum state contains the index component, which may severely decrease the value of the kernel. It is generally difficult to forget the index component, as argued in [16], which we simply refer to as removing the index component; nonetheless, we propose a protocol to achieve this goal using a polynomial number of operations (i.e., query complexity) under a valid condition, which is fortunately satisfied by the features used in typical problems in bioinformatics. We then provide some concrete protocols to further compute the target kernel and discuss their query complexity. Also, we prove that they require fewer operations to generate the feature vector than a classical sampler used to approximate the same vector.
Hence, up to the difference of the sense of complexities, the proposed protocols have quantum advantage. Lastly, we use the above-mentioned typical bioinformatics problem to investigate whether the classifier based on the proposed quantum kernel achieves a higher classification accuracy than existing graph classifiers.

Algorithm for graph kernel computation
We consider a graph characterized by the pair G = (V, E), where V = {v 1 , v 2 , · · · , v n } is an ordered set of n vertices, and E ⊆ V × V is a set of undirected edges. Hereafter, we use the notation n = |V |. Also, let d be the maximum degree of the graph G. In this study, G is assumed to be a simple undirected graph that does not contain self-loops or multiple edges.
The first step of our algorithm is to encode the graph information of G onto a quantum state defined on the composite Hilbert space H index ⊗ H feature . The index space H index is composed of n qubits, which identifies a subgraph characterized by a set of vertices represented by the binary sequence x of length n. The feature space H feature is composed of m qubits, each state of which represents the feature information of a chosen subgraph x. The value of m depends on what feature is used. In this study, we consider the case m = O(log n), where each qubit represents the numbers of vertices, edges, and vertices with a degree of 1, 2, and 3; refer to Toy Example section for a concrete example. Now, we assume an oracle operator that encodes the feature information of the chosen subgraph, identified by the index x, to the function E(G, x) and then generates the feature state |E(G, x) . Then, using the superposition principle of quantum mechanics, we can generate the quantum state containing the features of all subgraphs of a graph G as follows: where the normalized coefficient is omitted to simplify the notation. Next, we aim to compute the similarity of two graphs G and G . For this purpose, it seems that the inner product of |Ḡ and |Ḡ may be used. Note that when the two graphs have the same feature E(G, x 1 ) = E(G , x 2 ) with different indices x 1 = x 2 , they should still contribute to the similarity of G and G , whereas the inner product of |x 1 |E(G, x 1 ) and |x 2 |E(G , x 2 ) is zero. Therefore, what we require is the state instead of Eq. (1). However, an exponential number of operations is generally necessary to remove the index state [16]. The first contribution of our study is that our algorithm only needs a polynomial number of operations to obtain Eq. (2) from Eq. (1) under a condition that may be satisfied in features useful for many graph classification problems. The algorithm to remove the index state is described in terms of the general discrete function f as follows (see also Fig. 1): The state generated via the oracle operation is rewritten as . Then, we apply H ⊗n ⊗ I to obtain where the "other terms" contain all quantum states with index states other than |0 n . We now make a measurement in the computational basis of the index state; then, the post-selected feature state obtained when the measurement result is 0 n is given by which is our target state. The probability to successfully obtain this state is To evaluate the efficiency of the proposed algorithm, we derive the minimum of the probability (5) with respect to the family of set X = {X 1 , · · · , X k } under the condition a k=1 |X k | − 2 n = 0. Hence the problem is to maximize where λ is the Lagrange multiplier. Then, |X k | = 2 n /a maximizes L(X, λ); thus, Eq. (5) has the following lower bound: The above result is summarized as follows: Theorem 1. The quantum circuit depicted in Fig. 1 generates the quantum state (4) with a probability of at least a −1 .
Therefore, the quantum state (4) can be effectively generated if a = |Y | is of the order of a polynomial with respect to {y k }. This desirable assumption holds in the encoding function E(G, x) that is used in the bioinformatics problem analyzed later in this paper. For a general statement of this fact, we introduce the following constraint on the range of the function: the set are assumed to satisfy where Pol(v c−1 ) is a polynomial function with maximum degree c − 1. These conditions lead to Then, Theorem 1 can be further refined as follows (the proof is given in the Supplementary Information): Theorem 2. Given the condition (7), the algorithm depicted in Fig. 1 generates the state (4) with a probability of at least Ω( √ n/n c ).
In what follows we assume that the encoding function E(G, x) satisfies the condition (7); then the desired state transformation (1) → (2) only requires a O(n c / √ n) mean query complexity. As a consequence, we are now able to effectively compute the inner product G|G , as the similarity measure of the two graphs G and G (note that both |G and |G are real vectors). The task of computing (9) is typically done via the swap test [17]. A diagram of the post-selection process from |0 |Ḡ |Ḡ to |0 |G |G is depicted in Fig. 2. The total mean query complexity required to prepare both |G and |G and subsequently apply the swap test to compute the inner product (9) is given by where n and n are assumed to be of the same order. Note that Eq. (10) contains the constant overhead required to repeat the swap test circuit to compute the inner product with a fixed approximation error. The resulting inner product computed by the swap test is represented by where f G = [|X 1 |, . . . , |X a |] is the column vector and the coefficient k BH (G, G ) is given by Importantly, Eq. (11) is exactly the Bhattacharyya (BH) kernel [18,19], which has been successfully applied to image recognition [19] and text classification [20].
|0 Figure 2: Quantum circuit to compute the inner product (9), which is composed of the oracles U and U to produce |Ḡ and |Ḡ ; this is followed by the post-selection operation to obtain |G |G and the swap test. The probability of obtaining 0 as a result of the measurement on the first qubit is Pr(0) = (1 + | G|G | 2 )/2, which enables us to estimate the inner product (9).
Alternative algorithm with switch test Here, we show that the use of the superposition |0 |G + |1 |G (12) can also be used to compute the inner product (9) which eventually yields a different kernel than K BH (G, G ), yet using the same order of queries as the previous case. Similar to the previous case, the superposition (12) can be effectively generated by the post-selection operation on using the circuit depicted in Fig. 3. We first apply the controlled H ⊗n and H ⊗n on the index state of Eq. (13) and then post-select the state when the measurement result on the index is 0 n . The formal statement of the result in terms of the general discrete functions f and g is given as follows (the proof is given in the Supplementary Information): Theorem 3. Let Y and Y be the ranges of f and g, respectively, and let a = |Y | and a = |Y |. Thus, Y = {y k } a k=1 and Y = {y k } a k=1 with y k and y k elements of Y and Y , respectively. Also, let for n ≥ n . The quantum circuit depicted in Fig. 3 uses the oracle to prepare the state Then, the final feature state of the circuit, which is postselected when the measurement result is 0 n in the index state, is given by The probability of obtaining the state (14) is at least As in the previous case, by imposing the encoding functions E(G, x) and E(G , x) to satisfy the condition (7), the quantum circuit depicted in Fig. 3 transforms Eq. (13) to Eq. (12) with a probability of Ω( √ n/n c + √ n /n c ) = Ω( √ n/n c ), where n and n are assumed to be of the same order. The proof of this result is the same as that of Theorem 2.

|0
H Figure 4: Quantum circuit to compute the inner product (9), which is composed of the oracles U and U to produce |0 |Ḡ + |1 |Ḡ . This is followed by the post-selection operation to obtain |0 |G + |1 |G and the switch test. Additionally, n ≥ n is assumed. The probability of obtaining 0 as a result of a measurement on the first qubit is Pr(0) = (1 + G|G )/2, which enables us to estimate the inner product (9).
We have now obtained the state |0 |G + |1 |G , which enables the application of the switch test [21,22] to compute the inner product (9). The circuit diagram, which contains the post-selection operation, is shown in Fig. 4. The total query complexity to compute the inner product (9) is O(n c / √ n). The resulting inner product computed through the switch test, which we call the SH kernel, is given by where, again, f G = [|X 1 |, . . . , |X a |] , and the coefficient k SH (G, G ) is given by In the Supplementary Information, we prove that this is a positive semidefinite kernel. Also we show there that K SH (G, G ) ≤ K BH (G, G ) holds, implying that the SH kernel may give a conservative classification performance than BH. Note that the generalized T-student kernel [23] has a similar form.
Improved algorithm with amplitude amplification Figure 5: Quantum circuit with amplitude amplification to prepare the state (2) (upper) and (12) (lower). The circuit that is iterated to realize the amplitude amplification is enclosed in the dashed line.
Recall that we used post-selection on the state (3): |X k | |y k + other terms, to probabilistically produce the first term We can enhance the first term using the amplitude amplification (AA) operation [24,25]) to deterministically produce the same state (16). The clear advantage of AA is that it requires the square root of the number of operations to obtain this state. This is preferable over the previous repeat-until-success strategy. This means that the query complexity to transform Eq.
Similarly, transforming Eq. (13) to Eq. (12) via AA requires a query complexity of O( √ a/n 1/4 ). These circuits are depicted in Fig. 5; note that the measurement on the index state is not necessary.
The circuit that includes the swap test and AA to compute the inner product G|G = K BH (G, G ) is depicted in Fig. 6 (upper). The circuit length is Also, the circuit containing the switch test and AA is depicted in Fig. 6 (lower); as in the case of swap test, the circuit length is O( √ a/n 1/4 ). In particular, if the encoding functions E(G, x) and E(G , x) satisfy the condition (7), then we can specify a = O(n c ).

|0
H Figure 6: Quantum circuit to compute the inner product (9); it is composed of the oracles enhanced via AA followed by the swap test (upper) and the switch test (lower).

Time complexity for specific encoding functions
Here, we discuss the time complexity that takes into account the number of elementary operations contained in the oracle. We particularly investigate the following two encoding functions: where #v is the number of vertices of the subgraph specified by x, and #e is the number of edges of x. Additionally, #dD (D ∈ {1, 2, 3}) denotes the number of vertices that have a degree of D. Then, from Lemmas 4, 5, and 6 given in the Supplementary Information, the time complexities required to calculate Eq. (17) and Eq. (18) are respectively. Recall that d is the maximum degree of the graph G. The cardinality a = |Y | for the range Y can be evaluated as in the case of Eq. (17), and in the case of Eq. (18). Note that Eqs. (17) and (18) satisfy the condition (7). Hence, the time complexity of the quantum algorithm, assisted by AA, is evaluated as follows: in the case of Eq. (17), it is and in the case of Eq. (18), it is where |E| = n 2 and d = n are used. In contrast, the time complexity of typical existing graph kernels is O(n 3 ) for the random walk method [5] and O(n 4 ) for the shortest paths method [7]. Hence, our quantum computing approach for kernel computation has a time complexity comparable to that of the typical classical approach. However, note again that our kernel reflects features from all subgraphs, which are not covered by existing methods. We consider two simple toy graphs, which are depicted in Fig. 7, to demonstrate how to construct the corresponding quantum feature states |G and |G . The encoding function is chosen as E ved (G, x), given in Eq. (18). Note that both graphs have 2 3 = 8 induced subgraphs; thus, we need n = 3 qubits to cover all subgraphs.
Hence, by considering the normalization factor, the inner product (i.e., the similarity of the graphs) is calculated as follows: in the case of the swap test, it is and in the case of the switch test, it is Note that they are certainly bigger than the value Ḡ |Ḡ = 0.75 computed via the naive method.

Advantage over classical sampling method
In the classical case, an exponentially large resource is necessary to compute a feature vector, corresponding to Eq. (2), which considers all subgraphs; however, we can utilize an efficient classical sampling method that was used in [6] to approximate this feature vector. More specifically, we first sample S subgraphs from the entire graph G and then apply the encoding function E(G, x), which satisfies Eq. (7), on those subgraphs to approximate the feature vector. The kth component of this vector is given bŷ where x S = {x 1 , ..., x S } are the set of indices that identify the sampled subgraphs. Also, 1(A) represents the indicator function, which takes 1 when the condition A is satisfied and zero otherwise. Now, the kth component of the true feature vector is represented as We then have the following theorem (the proof is given in the Supplementary Information): Theorem 4. For a given > 0 and δ > 0, S = O a − log δ 2 log n samples suffice to ensure that where · 1 denotes the L 1 norm. In particular, for constant and δ, we have that S = O(a/ log n).
Therefore, the sample complexity of this classical method is O(a/ log n), whereas the query complexity of the proposed quantum algorithm is O(a/ √ n). Hence, up to the difference of the sense of complexities, the proposed method has a clear computational advantage. Note that the inner product G|G that is computed using the above classical sampling method is given by which differs from that computed in the quantum case (11) or (15).

Numerical experiment
Here we study the performance of classifiers constructed based on the proposed quantum kernel, with comparison to some typical classical classifiers. The quantum kernel was calculated, not using the quantum algorithm but via the direct calculation of the inner product (9), in an ideal noise-free environment on a GPU; for the details, see the Code Availability section in Supplementary Information. We calculate both the BH kernel (11) and the SH kernel (15), for the two different encoding functions E ve given in Eq. (17) and E ved given in Eq. (18). We compare the quantum graph kernels to the following classical graph kernels: the random walk kernel (RW) [5], the graphlet sampling kernel (GS) [6], and the shortest path kernel (SP) [7]. These three classical kernels are simulated using Python's GraKeL library [26].
We use the following benchmark datasets obtained from the repository of the Technical University of Dortmund [27]. For the case of binary classification problems, we used: AIDS (chemical compounds with or without evidence of anti-HIV activity [28]); BZR MD (dataset BZR of active or inactive benzodiazepine receptors [29]; converted to complete graphs [30]); ER MD (dataset ER of active or inactive estrogen receptors [29]; converted to complete graphs [30]); IMDB-BINARY (the movie genre is action or romance based on its co-starring relationship [31]); MUTAG (chemical compounds with or without mutagenicity [32]); and PTC FM (chemical compounds in the PTC dataset [33] that are carcinogenic or noncarcinogenic to female mice [30]). As for the multi-class classification problems, we used: 15-classes Fingerprint (fingerprint images converted to graphs and divided by type [34]) and 3-classes IMDB-MULTI (the movie genre is comedy, romance, or sci-fi based on its co-starring relationship [31]). Due to the limitation of the GPU memory, we took graphs with less than or equal to 28 vertices. As a result, (the number of graphs)/(the total number of graphs) are 1774/2000 for AIDS, 296/306 for BZR MD, 398/446 for ER MD, 860/1000 for IMDB-BINARY, 188/188 for MUTAG, 331/349 for PTC FM, 2148/2148 (excluding graphs with #edges less than 1) for Fingerprint, and 1406/1500 for IMDB-MULTI. The necessary number of qubits is 28 + log 28 + log (28 × 27/2) ∼ 41 for the case E ve and 28 + log 28 + log (28 × 27/2) + log 28 + log 28 + log 28 ∼ 56 for the case E ved .
We apply the C-support vector machine (SVM), im-plemented via Scikit-learn [35], to classify the dataset. To evaluate the classification performance, we calculate the mean test accuracy, by running 10 repeats of a double 10-fold cross-validation. In addition, we calculate Fmeasure, which is used when the number of data in different classes are unbalanced; in fact, the numbers of data of two classes are 63 and 125 for MUTAG and 400 and 1600 for AIDS. The SVM parameter C is taken from the discrete set {10 −4 , 10 −3 , . . . , 10 3 }, and the best model with respect to C is used to compute the classification performance. The result are summarized in Table 1. The table shows that, in many cases, the proposed quantum kernel achieves better classification accuracy than that obtained via the classical kernels. Note that the AIDS dataset is sparse and ER MD dataset is dense; the quantum kernels show the better performance in both cases, implying that they are not significantly affected by the density of the graph dataset. In many cases, the two kernels BH and SH show a similar performance, but there are some visible differences depending on the dataset; this may be due to the property K SH (G, G ) ≤ K BH (G, G ), which is proven in Lemma 2 in Supplementary Information. Also, although E ved has more features than E ve , from the table we do not observe a clear superiority of the former over the latter in the classification accuracy; we will discuss further on this point in the next section.
Lastly let us check the probability for successfully removing the index register; recall from Theorem 2 that the lower bound is Ω( √ n/a). Figure 8 shows the success probabilities when Eq. (18) is used as the encoding function, in which case the lower bound is Ω( √ n/a) = Ω(n −5.5 ). As shown in the figure, the actual success probability is much higher than the lower bound, indicating that the quantum algorithm for removing the index state will be more efficient than expected by the theory.

Discussion
In this paper, as a main result, we provided the condition and the protocol for removing the index state with polynomial query complexity. The encoding function E(G, x) that extracts features from a subgraph x, given by Eq. (17) or (18), satisfies this condition, which allows us to construct the graph kernel that correctly reflects features of all subgraphs. We gave a proof-of-principle numerical demonstration to solve the problem of classifying various type of graph set containing graphs at most 28 vertices, via the quantum simulator composed of 41 or 56 qubits. The proposed algorithm that efficiently removes the index states will be useful in various other problems such as the task of counting the same words for text classification problems using the Bhattacharyya kernel [20].
We here give a remark on the choice of E(G, x). One would consider that E(G, x) with more features including e.g. a cycle structure [36], which thus has a bigger range of function, may lead to better classification performance, although it needs more query complexity for removing the index state and thereby constructing the kernel. (In particular, as is well known, if E(G, x) and x are one-to-one correspondence, we need an exponential order of query complexity to do this task.) However, the important fact revealed by the numerical demonstration is that such a bigger-range encoding may be not required; actually, we found that E ve (G, x) and E ved (G, x) lead to almost the same classification performance. This is presumably because the proposed kernel covers all subgraphs. Hence, a relatively simple E(G, x) might be sufficient, which is the advantage of our quantum algorithm. In other words, existing kernels that do not cover all subgraphs may need to contain more features.
A disadvantage of the kernel method is that it needs heavy computational cost for calculating the inner products G i |G j for all pairs of (G i , G j ) contained in the training dataset, in order to construct the Gram matrix. The generalization of the switch test protocol given in Theorem 3 may give a solution to this issue. That is, we could have an algorithm that generates a superposition |1 |G 1 + |2 |G 2 + |3 |G 3 + · · · and thereby efficiently construct the Gram matrix by some means. This direction is worth investigating, as it is the scheme demanded in the field of kernel-based quantum machine learning.
The algorithms posed in this paper are all difficult to implement on a near-term quantum device. A key approach may be to develop a valid relaxation method of the condition, because very precise computation of the kernel value may be not necessary.

Supplementary Information
This supplementary information contains proofs of the theorems in the main text, properties of the SH kernel, some lemmas related for constructing E(G, x), and an additional information for our numerical simulations.

Proof of theorems
Proof of Theorem 2. Suppose that the set X v,h satisfies We rewrite Eq. (5) as To calculate the lower bound of we define the cost function where λ denotes the Lagrange multiplier. Clearly, maximizes L(|X v,1 |, . . . , |X v,|Yv| |, λ). Thus, the probability that we obtain 0 n when measuring the index state is evaluated as where we used Eq. (8) to have |Y v | = Pol(v c−1 ) ≤ βn c with some constant β. Note that Ω(·) is defined as p(n) = Ω(q(n)) through two probability distributions p and q satisfying ∃n 0 , ∃M > 0 s.t. n ≥ n 0 ⇒ p(n) ≥ M q(n), for real valued functions p(n) and q(n).
Proof of Theorem 3. According to Fig. 3, we perform the controlled-H ⊗n followed by the controlled-H ⊗n on the following state Then, we have |X k | |y k + other terms The probability that we obtain 0 n via the index state measurement is given by Owing to Eq. (6), this probability is lower bounded by Therefore, we can remove the index state with probability at least (a −1 + a −1 )/2.
Proof of Theorem 4. We use the result given in [37]; for the empirical distribution of a sequence of independent identically distributed random variables,P x S , and the true distribution P , the following inequality holds: where ϕ(p) and π P are given by Here we assumed P (y k ) < 1/2, which in fact holds in our case. Note that δ in Eq. (20) is defined by the rightmost side of Eq. (21). Now, because the true probability distribution P is given by Eq. (19), we obtain Then, from Eq. (7), we have the following inequality: 1 (n/2) c−1 n n/2 1 2 n ≤ π P ≤ n n/2 Thus, ϕ(π P ) is of the order of Ω(ln n), and then δ can be evaluated as δ ∼ 2 a exp(−S 2 Ω(ln n)), from Eq. (21). As a result, we find Therefore we arrive at S = O(a/ log n).

Properties of the SH kernel
Positive semidefiniteness First, we prove that the SH kernel is positive semidefinite, which is necessary to construct a valid classifier based on the SH kernel.
Proof. We use the following general fact; if κ 1 and κ 2 are positive semidefinite kernels, then the product κ(G, G ) = κ 1 (G, G )κ 2 (G, G ) is also a positive semidefinite kernel. Now, because f G f G is positive semidefinite, we prove that k SH (G, G ) is positive semidefinite. For this purpose, we define Then, Also we define Below we will show that the matrix (k SH (G x , G y )) x,y=1,...,N is represented as meaning that k SH (G, G ) is positive semidefinite. The proof is divided into the three cases (i), (ii), and (iii).
(i) The case x = y.
Here, we define ρ(t), σ(t) (t ∈ [0, x − 1]) as Then we have When t ≥ 1, Here, q x,y = 2 ny 2 nx = 2 ny /2 nt 2 nx /2 nt = q t,y q t,x holds, and thus we have As a result, we obtain the following equation: The case x > y. The result directly follows by exchanging x and y in the discussion given in the case (ii); Summarizing, we have Eq. (22), meaning that the kernel K SH (G, G ) is positive semidefinite.

Relationship with the Bhattacharyya kernel
In this paper, we proposed two kernels: the Bhattacharyya (BH) kernel (11) arising in the case of swap test and the SH kernel (15) arising in the case of switch test. The following relationship holds: Lemma 2. The BH kernel (11) is bigger than or equal to the SH kernel (15).
Proof. Note that f G f G is common in both kernels. From the inequality of the arithmetic and geometric means, we obtain the following inequality: This result means that, as a similarity measure, the SH kernel is more conservative than the BH kernel (note that both kernels are upper bounded by 1, because they originate from the inner product G|G ). Figure 9 depicts the relation between these kernels for the case of MUTAG training dataset; this result implies that, in general, the difference between the kernel values would be tiny, but as mentioned above, there will be a solid difference in the prediction classification accuracy in view of the conservativity of the corresponding classifiers. Proof. We use the multi-controlled NOT gate C i−1 X, which operates the NOT gate X on the i th target qubit, if the value of the control qubits (i.e., the 0 th , 1 st , . . . , (i − 1) th qubits) are all 1. The adder of 1 to k can be done by repeatedly applying C i−1 X for i = 1, . . . , log k on the state |k . Now C i−1 X is composed of O(i) Toffoli gates [38]; in other words, the time complexity for operating C i−1 X is O(i). Hence, the maximum of the total time complexity of adding 1 to the state |k is  Proof. The oracle is realized as the set of adder of 1 controlled by the index state |x ; that is, if the ith index qubit is |1 (meaning that the ith vertex is contained in the subgraph x), then the corresponding controlled adder adds 1 on the feature state. The index runs from i = 0 to i = n = |V |, and hence there are totally n controlled adders. Also each controlled adder needs O((log n) 2 ) time complexity due to Lemma 3. Therefore, the total time complexity is n × O((log n) 2 ) = O(n(log n) 2 ). Proof. The idea is the same as that of Lemma 4, except that the adder is controlled by the pair of index qubits representing an edge of the subgraph x. Because there are |E| such pairs, the total time complexity is |E| × O((log |E|) 2 ) = O(|E|(log |E|) 2 ). Lemma 6. The time complexity for preparing the feature state |#dD (the number of vertices with degree D, where D = 1, 2, 3) of a subgraph x is O(n((log n) 2 + d(log d) 2 )).
Proof. Recall that d is the maximum degree of the graph G. First, we store the degree of the ith qubit, d, into an auxiliary log d qubits. This can be done by performing the adder controlled by each qubit to which the target vertex is connected. This operation requires log d qubits and O(d(log d) 2 ) time complexity. Next, if the stored value is D, we perform the adder controlled by the auxiliary qubits, on the feature state. The required space of #dD is O(log n), because #dD ≤ n. The time complexity is O((log n) 2 ). Finally, we initialize the auxiliary qubits by the inverse operation. We perform these calculation recursively from the 0 th qubit to the n − 1 th qubit. As a result, the total time complexity is

Note on the simulation method
In the numerical experiment, we studied various type of graph set containing a graph with 28 vertices, in which case we need a quantum device composed of 56 qubits. A naive implementation of the numerical simulator would require a 2 56 memory, probably larger than 1 Exabit depending on the accuracy, which is not realistic. Hence, we adopted a parallel GPU computation; that is, we compute |E(G, x) for each subgraph x and store it in each GPU memory, which thus needs a 2 28 memory.

Data Availability
A complete set of the kernel values and the probabilities for removing the index state are available at https://github.com/TRSasasusu/ GraphKernelEncodingAllSubgraphsQC.