Expressivity of Variational Quantum Machine Learning on the Boolean Cube

Categorical data plays an important part in machine learning research and appears in a variety of applications. Models that can express large classes of real-valued functions on the Boolean cube are useful for problems involving discrete-valued data types, including those which are not Boolean. To this date, the commonly used schemes for embedding classical data into variational quantum machine learning models encode continuous values. Here we investigate quantum embeddings for encoding Boolean-valued data into parameterized quantum circuits used for machine learning tasks. We narrow down representability conditions for functions on the $n$-dimensional Boolean cube with respect to previously known results, using two quantum embeddings: a phase embedding and an embedding based on quantum random access codes. We show that for any real-valued function on the $n$-dimensional Boolean cube, there exists a variational linear quantum model based on a phase embedding using $n$ qubits that can represent it and an ensemble of such models using $d<n$ qubits that can express any function with degree at most $d$. Additionally, we prove that variational linear quantum models that use the quantum random access code embedding can express functions on the Boolean cube with degree $ d\leq \lceil\frac{n}{3}\rceil$ using $\lceil\frac{n}{3}\rceil$ qubits, and that an ensemble of such models can represent any function on the Boolean cube with degree $ d\leq \lceil\frac{n}{3}\rceil$. Furthermore, we discuss the potential benefits of each embedding and the impact of serial repetitions. Lastly, we demonstrate the use of the embeddings presented by performing numerical simulations and experiments on IBM quantum processors using the Qiskit machine learning framework.


I. INTRODUCTION
Machine learning problems involving categorical data are prevalent across many domains. The range of a categorical variable lies in a finite set, and each element in this set can be associated with an integer. Thus one can map a single categorical variable to multiple binary variables. If our goal is to perform supervised learning, then this converts the problem into learning a real-valued function on the n-dimensional Boolean (hyper)cube B n = {0, 1} n . This implies that models that can express large classes of functions on the Boolean cube are useful for problems involving categorical data.In this article, we consider using variational quantum machine learning (VQML) [1] to fit real-valued functions of multiple binary variables. Thus, such models can be applied to regression or classifica-tion tasks. Beyond machine learning, variational quantum algorithms [2] have been applied to chemistry [3]- [5], combinatorial optimization [6], [7], quantum linear systems [8], [9], and the simulation of quantum dynamics [10], [11]. When applied to supervised learning, VQML consists of using parameterized quantum circuits (PQCs) built from two types of circuit blocks: embedding blocks, which encode the inputs into a quantum system, and trainable blocks, where learnable parameters are adjusted in order to optimize the output results. There are various methods for optimizing the learnable parameters in a hybrid classical-quantum iterative manner, including analytical gradients [12]- [16]. This paradigm has been used to construct various analogues to classical machine learning models applicable to supervised learning tasks [17]- [23]. VOLUME 4, 2016 As an example of the potential applicability of VQML, it has been observed that these models can be used to solve a variety of financial problems, such as fraud detection and creditworthiness determination [24]- [26]. There has been an active line of studies to characterize the expressivity [27]- [29], the generalizability [30]- [38], and the trainability [39]- [41] of VQML models. However, the specific case of quantum models with discrete-valued inputs has not been investigated as extensively [42], [43].
A recent study by Schuld et al. [44] showed that the output, when it is represented by the expected value of an observable, of a VQML model can be expressed as a partial sum of a multidimensional Fourier series. The connection between VQML models and Fourier series was also observed in [45]. Recently, Caro et al. [35] derived generalization bounds for such models. The range of attainable frequencies is related to the quantum embedding used, and this range can be broadened by repeating the embedding sequentially, a process called data re-uploading [46], or by introducing additional sets of qubits and repeating the embedding in parallel for each set. The observable and trainable blocks control the coefficients of the Fourier basis elements in the partial sum. Since every function in L 2 ([0, 2π] n ) can be represented by the limit of a Fourier series [47], VQML models can approximate any function in this space to arbitrarily small error, in L 2 norm, by using an embedding scheme that produces the required Fourier spectrum. This also assumes that the observable and trainable blocks can fit the Fourier coefficients to the desired error, which Schuld et al. assumed when deriving their results. Along similar lines, Goto et al. [48] demonstrated that models built from a linear combination of basis functions derived from quantum-enhanced feature spaces are universal for continuous functions. Similar to Schuld et al., the embeddings that Goto et al. used consisted of serial and parallel repetitions of simple encoding schemes. We show that variational linear quantum models, which do not make use of serial or parallel repetitions of a quantum embedding, are sufficient for representing functions on the Boolean cube. Variational linear quantum models use PQCs that consist of one embedding block and one trainable block. For two quantum embeddings, we use Fourier analysis to derive the classes of real-valued functions on the Boolean cube that can be represented by variational linear quantum models. The number of qubits used only depends on the dimension of the input.
In this paper, we explore two research directions. 1) First, we consider a phase embedding, which encodes each input bit into the relative phase of a single-qubit state, i.e. the number of qubits used equals the number of input bits. We show that any real-valued function on the Boolean cube, B n , can be represented by a variational linear quantum model that uses the phase embedding and that a classical ensemble, formed by summing the outputs of multiple models each using d qubits, can express any function with degree ≤ d. The degree of a function on B n is the maximal Hamming weight over all s ∈ B n where the Fourier transform is nonzero. 2) Then, we further consider a QRAC embedding-a quantum embedding that makes use of quantum random access codes (QRACs) [49], [50]. This embedding was introduced by Yano et al. [43] for encoding categorical data into variational quantum classifiers. We investigate the classes of functions expressible by variational linear quantum models using this embedding. Specifically, we show that any function of degree d ≤ n 3 can be represented by a classical ensemble formed by summing the outputs of multiple QRACembedding-based variational linear quantum models each using n 3 qubits. We note that the above results imply that for functions with degree ≤ n 3 an ensemble of phase embedding models requires only n 3 qubits, which is the same number of qubits used with the QRAC embedding. However, it can still be beneficial to use the QRAC embedding in certain cases as discussed later.
Juntas form an important class of functions on Boolean domains. A k-junta is a function, Boolean or real-valued, that depends on at most k out of the n input bits. These functions are useful in computational learning theory [51] for modeling learning tasks where the data can be explained using a subset of the available features [52]. Such scenarios typically occur when applying supervised learning to realworld data sets [53]. There has been a lot of progress in developing quantum computational learning theory [54]. For example, there exist algorithms in the query model [55] for both learning and testing k-juntas [56]- [59], some of which make use of both quantum and classical queries. By definition, if k ≤ n 3 , then the degree of the junta is guaranteed to be at most n 3 so the junta can be represented by an ensemble of linear quantum models that use the QRAC or phase embedding.
With regards to classical neural networks, there has been recent work investigating the learnability of parity functions [60] and real-valued functions on the Boolean cube [61]. There are also neural networks for Lattice Regression [62] and the recent Hierarchical Lattice Layer [63] for partially monotone regression.
We perform experiments on simulators and on IBM Quantum hardware to study the expressiveness of variational linear quantum models for low-degree functions. These experiments demonstrate the efficacy of the phase and QRAC embeddings for representing functions on the Boolean cube.

A. MAIN RESULTS
Summarizing, we list here the main contributions of this paper.
1) We show that for any function on the Boolean cube B n , there exists a variational linear quantum model with n qubits based on a phase embedding such that the output of the quantum model agrees with the target function for all inputs. Additionally, we show that for any function with degree ≤ d there exists an ensemble of variational linear quantum models using the phase embedding and d qubits such that the output of the ensemble agrees with the output of the target function for all inputs. 2) We then present sufficient conditions for variational linear quantum models using a QRAC embedding to be able to express functions of degree d ≤ n 3 using n 3 qubits. Moreover, we then demonstrate that for any function of degree d ≤ n 3 on the Boolean cube B n , there exists an ensemble of QRAC-based variational linear quantum models with n 3 qubits each such that the output of the ensemble agrees with the output of the target function for all inputs. 3) We test these two embeddings on low-degree functions on the Boolean cube via numerical experiments and on IBM superconducting quantum processors.
We note that results derived for the phase and QRAC embeddings were proven under the assumption of universal trainable gates and arbitrary observables that are diagonal in the computational basis.

B. PAPER ORGANIZATION
Section II reviews the Fourier analysis of functions with Boolean inputs and the use of PQCs for machine learning. Section III introduces embeddings for representing functions on the Boolean cube with PQCs. Then, we use tools from Fourier analysis to study the expressivity of variational quantum machine learning models that make use of these embeddings. In Section IV we apply variational quantum models, using either the phase or QRAC embeddings, to supervised learning problems involving low-degree functions on the Boolean cube. These experiments were run in simulation and on IBM Quantum hardware. Lastly, the appendices contain further computational elaborations of the topics discussed in the main text.

II. PRELIMINARIES
This section introduces the concepts necessary to understand the novel contributions of this paper. Particularly, it focuses on the Fourier analysis on the Boolean cube and gives an overview of the state of the art of variational quantum machine learning.

A. FOURIER ANALYSIS ON THE BOOLEAN CUBE
First, we briefly review the Fourier analysis of real-valued functions with Boolean inputs. This short review is based on the introduction by de Wolf [64]. We consider the 2 ndimensional real vector space G := {f : B n → R}. This space can be equipped with the following inner product: To every tuple s ∈ B n , we associate a function χ s : B n − → {±1} that is defined as follows: where s · b is given by the scalar product: The function χ s depends on the parity of a subset, indicated by s, of the input bits. With respect to the inner product defined in (1), the set containing all χ s forms an orthonormal basis for G, called the Fourier basis. The Fourier transform of a given f ∈ G, denoted by f , is defined as follows: Because the set of all χ s forms an orthonormal basis, it holds that any f can be expressed as where the value f (s) is the Fourier coefficient associated with χ s and the set of all f (s) is called the Fourier spectrum of f . The degree of f is the maximal Hamming weight over all s ∈ B n such that f (s) = 0.
In Section I, we introduced a k-junta as a function on B n whose output only depends on k of the n input variables b 1 , b 2 , . . . , b n . For a given k-junta, suppose C ⊆ [n] := {1, . . . , n} contains the k indices corresponding to the input variables that the junta depends on. It can be easily shown that the Fourier transform of a k-junta can only be nonzero on elements from the set Thus, the degree of the junta is bounded by |C| = k, so when k n, a k-junta is guaranteed to also be a lowdegree function. In the following sections, we will use these definitions to analyze the classes of functions on B n that can be expressed by VQML models.

B. VARIATIONAL QUANTUM MACHINE LEARNING
This section reviews relevant concepts of VQML [1]. Before moving to functions defined on the Boolean cube, we consider the task of fitting a real-valued function that is defined on an arbitrary set A ⊆ R n . For a continuous-valued range, this task is called regression, and for a discrete-valued range, it is called classification. The input data x ∈ A, stored on a classical memory, can be embedded into a quantum state by utilizing an m-qubit parameterized unitary operator U (x), which is a unitary-operator-valued function of the n-dimensional vectors in A. The operator U is called a quantum embedding. VOLUME 4, 2016 Let us consider a parameter-independent Hermitian observable D, defined to be diagonal with respect to the computational basis, and thus: We further define a parameterized observable O θ with variational parameters θ as follows: where W (θ) is a unitary operator implemented by a PQC.
The VQML model that we focus on in this work is the variational linear quantum model: where ρ( is the state of the system after the action of the unitary U (x) and Tr is the trace operator. Essentially, f θ (x) maps x to a real number by taking the expectation of O θ with respect to ρ(x). This model is also called a quantum neural network [30], and in the context of classification, it has been called the explicit linear quantum classifier [65] or variational quantum classifier [19]. Since the expectation of an observable is continuous valued, for classification, some post-processing of the output is required to map it to the finite set of possible classes. We can implement the parameterized measurement by evolving ρ(x) by the parameterized unitary operator W (θ) and then measuring D. We relate this sequence of operations to Equation (9) as follows: where we used the cyclic property of the trace. The model in Equation (9) is linear in the sense of being a quantum analogue to the linear models [66] of classical machine learning [19], [67]. Explicitly, a linear model in classical machine learning is of the form: where φ : A − → F is called a feature map, and F is the associated feature space. In addition, w ∈ F is fixed for all inputs x ∈ A, but it is chosen to minimize some cost function by using an optimization procedure. For the model defined in Equation (9), φ(x) = ρ(x) maps x into a quantum feature space, which contains 2 m × 2 m density matrices representing quantum states called feature states. In addition, the observable O θ represents a θ-parameterized family of w's. In this case, each w in the family is a Hermitian matrix. The inner product in Equation (11) is now the Hilbert-Schmidt inner product. Lastly, a VQML model that interleaves embedding and trainable layers, which by (9) implies it is not a linear model, is known in literature as a data re-uploading model [46].
The expressivity of both classical and quantum linear models solely depends on the feature map used to encode x, as both w and O θ only define linear functions in the feature space. The feature maps can be used to make h or f θ nonlinear as functions on the domain A. In supervised learning, the goal is to minimize the regularized empirical risk: over the labeled training set where each x i ∈ A and each y i ∈ R. The functions and J are called the loss and regularizer, respectively. Classically, when φ(x) is difficult to compute explicitly or to operate on, we instead utilize kernel methods. The kernel function induced by the feature map φ is defined as: Kernel methods consider h, in Equation (11), as a function in the reproducing-kernel Hilbert space (RKHS) generated by k. This kernel trick is effective when the functional form of k is easier to evaluate than it is to explicitly compute the inner product between φ(x i ) and φ(x j ) as done in Equation (14). A common classical example is the Gaussian kernel, which efficiently computes inner products in an infinitedimensional feature space [66]. Suppose J(h) := z( h k ), where z : [0, ∞) − → R is strictly increasing, and h k is the norm of h in the RKHS. Then, according to the representer theorem [68], any minimizer h min of the regularized empirical risk (12) lies in the RKHS and is of the form: If is convex and J(h) = β h 2 k , where β ≥ 0, then the α x 's can be found by solving a convex optimization problem called kernel ridge regression. This requires computing the kernel matrix K, with entries K i,j = k(x i , x j ). In addition, K is a |T | × |T | real-valued symmetric matrix.
The quantum-kernel method, as originally stated [19], uses the fidelity kernel: where |ψ i = U (x i ) |0 m , and U is a quantum embedding. This kernel computes the Hilbert-Schmidt inner product between quantum feature states. Liu et al. [69] demonstrated a quantum speedup using such kernel methods for solving a discrete-log-inspired supervised learning problem. It has been observed that generalization can be difficult with the fidelity kernel, however, there exist heuristics [70], [71] and hyperparameter optimization techniques [72] to enable generalization. The kernel defined in Equation (16) is evaluated on a quantum device for all pairs of training data elements, which avoids performing classical operations on the 2 m -dimensional statevectors |ψ i . The entries of the corresponding kernel matrix are given to a classical computer to find the α x 's, which involves solving a convex optimization problem. Alternatively, the training procedure for VQML models does not consist of computing the kernel. Instead of finding the α x 's, we optimize the variational parameters θ, which can be a non-convex problem. This problem can still be non-convex regardless of whether the loss and regularization functions are convex [73], [74]. Furthermore, the quantum-kernel method has access to all minimizers, h min , of the regularized empirical risk (12), which lie in the RKHS generated by k Q . In contrast, the choices one makes for the PQC W (θ) and the observable D restrict the set of functions that Equation (9) can represent to a subset of the RKHS, which may not contain h min . Even if W (θ) can enact arbitrary global unitaries on m qubits, which requires the number of primitive gates to be exponential in m, the fact that D is fixed prior to training still restricts the set of functions that can be learned. These observations have led the community to consider whether there is any benefit in using variational linear quantum models instead of quantum-kernel methods [67]. One potential benefit of variational models is that the number of circuit runs used to train the model with parameter-shift methods [12] scales as O(dim(θ)×|T |). For quantum-kernel methods, the complexity of computing the kernel matrix requires O(|T | 2 ) evaluations of k Q . Thus, if the variational optimization of θ converges to an acceptable empirical risk value quickly enough, and dim(θ) |T |, then the variational model can have an advantage over the quantum-kernel method. However, the overparameterization of quantum models, i.e. the case where dim(θ) |T |, has also been investigated [27], [75]- [77]. Additionally, there are forms of regularization applicable to variational linear quantum models for which there is, currently, no analogue for quantum-kernel methods [65].
Jerbi et al. [78] proved that VQML models, including data re-uploading ones, can be approximately reduced to variational linear quantum models that use additional ancillas and a quantum embedding whose kernel is the identity. This kernel is classically computable, and furthermore, a quantum-kernel method using the identity matrix as the kernel would simply overfit the data. The authors performed experiments demonstrating that VQML models, including variational linear quantum models, can still generalize better than quantum-kernel methods. This includes cases where regularization, J, was applied to the quantum-kernel method used. Thus, it appears that the connection between quantumkernel methods and variational linear quantum models through the RKHS framework is limited. The authors of [78] noted that the generalization advantage that VQML models can have is based on the fact that, as mentioned, O θ restricts the space of functions that the variational model can represent. Thus, O θ not being able to realize arbitrary observables may actually be advantageous and act as additional regularization. Identifying these benefits is important because we will be heavily focusing on variational linear quantum models throughout our work. In section III, we discuss quantum embeddings for encoding Boolean inputs, i.e. A := B n , into variational linear quantum models.

III. QUANTUM EMBEDDINGS FOR THE BOOLEAN CUBE
This section presents the main theoretical contributions of this article. We discuss two quantum embeddings for realvalued functions on the Boolean cube: the phase embedding (Section III-A) and the QRAC embedding (Section III-B). As mentioned in Section II-B, the nonlinearity in the input x for models like Equation (11), both classical and quantum, comes from the feature map used. Thus, when analyzing the expressivity of a variational linear quantum model (9), we will fix the embedding scheme ρ(b), and determine the class of functions on the Boolean cube that the model can represent. However, when proving theorems, we will assume W (θ) is universal so that it can enact arbitrary global unitaries on m qubits, which implies that for any m-qubit unitary V there exists θ such that W (θ) = V . In addition, this means W (θ) may decompose into a number of primitive gates that is exponential in m. This implies that for any m-qubit observable M , there exists a setting of the parameters θ and a diagonal observable D such that Based on the assumptions just mentioned, if for g ∈ G := {g : B n → R} we can show the existence of an observable M (g) such that ∀b ∈ B n : Tr , then there exists a variational linear quantum model using O , for some D (g) and parameter setting θ. In Theorem 1, which applies to the phase embedding, we show the existence of such an observable M (g) for all functions g on the Boolean cube. The phase embedding produces an n-qubit product state for an n-bit input. For the QRAC embedding, the observable M (g) exists for a subclass of functions g with degree ≤ n 3 . Unfortunately, the Fourier transform of a function in this subclass cannot be nonzero on all elements of B n with Hamming weight ≤ n 3 . The exact conditions are presented in Theorems 3 and 4. The product state ρ QE (b) consists of n 3 qubits for an n-bit input vector.
We also present sufficient conditions under which a classical ensemble of VQML models can express functions on the Boolean cube. More specifically, we call the summation of the outputs of multiple variational linear quantum models a classical ensemble of quantum models, i.e.
where each f θi , indexed over a set D, uses the same embedding scheme. Prior work has dealt with quantum ensembles, i.e. superpositions, of models [79]- [82]. Our results on ensembles show the existence of a collection of VOLUME 4, 2016 observables {M (g,i) } i∈D indexed from some set D such that ∀b ∈ B n one has where each φ i classically preprocesses the input bits. The potential impact that classical preprocessing of the input data can have on VQML models was acknowledged in [19], [42], [44], [46]. Because the ensemble is the sum of the outputs of the multiple linear quantum models, if analyticgradient learning is used, then the parameters of the models can be updated in parallel. For the phase embedding an ensemble of n d models is sufficient to express any function of n input Boolean variables and degree at most d, see Theorem 2. The preprocessing functions select subsets of the input variables. In the case of the QRAC embedding, according to Theorem 5, the class of expressible functions is all of those with degree ≤ n 3 with the preprocessing functions being permutations, i.e. elements of the symmetric group on n elements.
As mentioned earlier, data re-uploading models, can be converted into variational linear quantum models [78]. This can be done approximately by introducing additional quantum registers to encode the gate parameters and additional controlled-rotation gates. There exist transformations that are exact, but they require either gate teleportation, which introduces additional classically-controlled rotation gates that are dependent on the input, or post selection. However, for the Boolean cube we show that standard variational linear quantum models are sufficient, and thus the mentioned transformations are not required. Although repeating the phase or QRAC embeddings sequentially, even without inserting trainable gates between repetitions, can provide some benefits, see Appendix A.
In practice, to construct a model that can be efficiently implemented, we need to select W (θ) such that it decomposes into O(poly(m)) primitive gates and select D to be a linear combination of O(poly(m)) elements from {I, Z} ⊗m . Such choices will introduce regularization, as mentioned in Section II-B, and restrict the class of functions in the RKHS that can be represented. The goal of variational optimization will be to find a setting of the parameters, θ, if it exists, such that O (g) θ = M (g) . In addition, even if such a choice of parameters exists, the ability to find θ through variational optimization will also depend on the loss function landscape. The loss landscape for variational models has been observed to be difficult to navigate in practice when the PQC used is highly expressive [39], [83]. The goal of the experiments in Section IV is to demonstrate two cases in which the optimization is possible.
Lastly, we make a comment on the related work of Thumwanit et al. [42]. The authors showed that Pauli rotations can be used encode discrete-valued inputs using fewer qubits than the total number of input bits by making use of a classical preprocessing function that maps tuples of input bits to trainable rotation angles. This introduces additional trainable parameters that are not present in the phase and QRAC embeddings. Also, it is not guaranteed that there always exists a mapping of multiple input bits to rotation angles that is sufficient for expressing the target real-valued function.

A. PHASE EMBEDDING
The phase embedding that we investigate was considered in Schuld et al. [44] for continuous-valued inputs. They showed that if the phase embedding is repeated r times sequentially with trainable blocks in between repetitions or repeated r times in parallel, then the output of the variational model is expressible as the r th cubic partial sum [47] of a Fourier series: where x is the n-dimensional input. The trainable blocks and observable control the c ω 's. These models can arbitrarily approximate functions in L 2 ([0, 2π] n ) if we assume the freedom to choose arbitrarily deep trainable blocks and an arbitrary observable, and utilize arbitrarily many repetitions of the embedding layer. More specifically, ∀ > 0 and any function in L 2 ([0, 2π] n ) there exists a model using some number r of repetitions of the phase embedding that approximates that function to error in L 2 norm. In this section, we show that any function on the Boolean cube can be represented by a variational linear quantum model, i.e. the case where r = 1, that uses the phase embedding.
Let H be the Hadamard operator, and where X is the Pauli-X operator. For b ∈ B n , we define the following to be the phase embedding unitary: Moreover, is the associated feature state. The following theorem summarizes the expressiveness of variational linear quantum models that makes use of this embedding.
Proof. Suppose O is an observable such that in the computational basis its entries are O k,j = g(k ⊕ j). Then, as required.
Additionally, O can always be diagonalized: O = V DV † and implemented by a variational linear quantum model with universal trainable gates, i.e. W (θ) = V † . The model uses a measurement D ∈ span R {I, Z} ⊗m . Each diagonal Pauli tensor in the linear combination that represents D can be measured using separate circuit runs, and the expectation value is then scaled by the corresponding coefficient of that diagonal Pauli tensor. In practice, D is typically chosen to be a simple and easy to evaluate observable, such as Z ⊗m , instead of being chosen arbitrarily from span R {I, Z} ⊗m as done in the proof of Theorem 1. Regardless, if W (θ) can implement arbitrary unitary operators, the model can only implement observables that have the same spectrum as D, i.e. g ∞ remains bounded. We leave as an open question as to whether interesting classes of functions exist that can be expressed when the spectrum of O θ is fixed, i.e. fixed D but varying W (θ).
If we have prior knowledge that the function is a kjunta, we can potentially reduce the number of qubits that the trainable portion of the model acts on utilizing a variational SWAP network. This is discussed in more detail in Appendix B.
The name "phase embedding" comes from the fact that it maps the input bits to |± , i.e. X-basis states. Equivalently, the phase embedding unitary can be replaced by Z (b) H ⊗n , where Z is the Pauli-Z operator. The phase embedding for continuous inputs, described in [44], was of the form: where R Z is a rotation generated by Z, and x ∈ [0, 2π] n . More specifically, this is the operator R Z (θ) := e −i θ 2 Z . The connection to Equation (22) is made by using the restriction of Equation (25) to the set {0, π} n , which can be seen to be equal to up to a global phase. Based on the proof given above, we define the Fourier coefficients of a variational linear quan-tum model, Equation (9), that uses the phase embedding to be f θ (s) = 1 2 n k∈B n j∈B n k⊕j=s This definition will be used in Section IV to see how well the model was able to fit the Fourier coefficients of the target function using supervised learning. We now present sufficient conditions under which an ensemble of models utilizing the phase embedding can represent functions on n bits utilizing fewer than n qubits. Let wt(·) compute the Hamming weight of a binary vector. Additionally, for each d ∈ [n], consider the set which contains an ordered d-tuple, for each subset of [n] of size d. Thus, it follows that |C n,d | = n d . For each w ∈ C n,d , we define a function ν w : i.e. selects a subset of the entries of b indicated by w.
Proof. Let η w : B d − → B n be the right inverse of ν w that sets all entries with indices not in w to zero, i.e. ν w • η b is the identity function on B d . Let k s := n n−d+wt(s) for all s ∈ {b ∈ B n | wt(b) ≤ d}. For each w ∈ C n,d , let w := {s ∈ B n |s i = 1 =⇒ i ∈ w} and define the function Note that for all χ s in (29) χ s (η w • ν w (b)) = χ s (b) since g (w) only depends on the input bits indexed by w. Thus, since g(s) ks χ s appears k s times in the sum. By Theorem 1, as required.
This proof shows that the upper bound on the number of models in the ensemble, using the phase embedding, is n d for a degree d function. When training, we can utilize a validation data set to determine if the size of the ensemble is sufficient. However, restricting the size of the ensemble also acts as a regularizer. VOLUME 4, 2016 Alternatively, we can make use of repetitions of the embedding to increase the expressivity of a single model that uses the phase embedding. Consider the following mqubit embedding with r data-encoding steps: where {w j } ⊂ C n,m for m < n. Note we have used the equivalent representation of the phase embedding presented in (26). The unitary operators V wj could be fixed or trainable. If they are fixed, then the overall model is still a linear quantum model, i.e. not a data re-uploading model, as all embedding steps come before the trainable components. However, we ensure that V w1 = H ⊗m . In Appendix A-A, we demonstrate that a model using this embedding can be expressed as a non-trivial linear combination of all Fourier basis terms when r = n/m and m divides n. Thus it can express degree-n functions on B n with fewer than n qubits.
In what follows, we discuss a different scheme that encodes multiple bits into a single qubit.

B. QRAC EMBEDDING
In this section, we analyze a quantum embedding, first described by Yano et al. [43], based on QRACs. However, the authors did not perform any analysis of the expressivity of models that make use of it. QRACs have also been used to encode MaxCut problems solved by variational quantum optimization algorithms [84]. We develop sufficient conditions for variational linear quantum models using the QRAC embedding to be able to express functions on the Boolean cube. Moreover, we show that an ensemble of models each using this embedding and n 3 qubits is sufficient to represent any function on B n with degree d ≤ n 3 . While this embedding can only provide a constantfactor saving in terms of the number of qubits, it could be impactful during the era of small and noisy quantum hardware provided that efficient and useful PQCs can be constructed. Like for the phase embedding in Theorem 2, the qubit reduction provided by the QRAC embedding comes from a set of classical preprocessing functions, described below. We use the (3,1)-QRAC, which is a three-bits-toone-qubit probabilistic encoding scheme. It was introduced by Ike Chuang and first mentioned in [49]. Thus, the number of (3,1)-QRACs used to encode an n-bit input is n 3 , which requires padding with passive variables if the input b is a tuple whose length is not divisible by 3. Note that this only increases the input length by at most two more zero bits and does not change the number of qubits used. Due to this, we assume, without loss of generality, n ≡ 0 mod 3.
We start by dividing the input b into n 3 triplets B 1 , B 2 , . . . , B n 3 , each of which is an element of B 3 . The entries of the i th triplet, B i , are indexed by the symbols for Pauli operators: B In addition, we define B (I) i := 0. For some angles α 1 and α 2 , the i th triplet can be encoded in the following single-qubit state: The unitary that takes |0 0| to σ(B i ) α1,α2 can be expressed as a composition of two rotations: where Specifically, when α 1 = π 4 and α 2 = 2 cos −1 we obtain the (3,1)-QRAC state [85]: where U 3,1 (B i ) is the operation in Equation (34) for the specific assignments of α 1 and α 2 mentioned above. These choices for α 1 and α 2 maximize the probability of recovering a single bit when measuring along one of the three Bloch-sphere axes. While we will be making use of the state in Equation (36), in our case, all of the results that follow would still hold if we had utilized any state with the form presented in Equation (33) and different α 1 and α 2 , as long as all coefficients of the Pauli operators are nonzero. The reason for this is that the proofs that follow only depend on the relationships between B (X) and the powers of (−1) that appear in the coefficients of the Pauli terms.
Generalizing to n > 3, we can encode b ∈ B n by using n 3 qubits in the following product state: Thus, the QRAC embedding unitary is Lastly, ∀m ∈ N, we define the sets where wt(·) computes the Hamming weight of a binary vector. For a fixed m, K QE m contains all elements, s, of B 3m with Hamming weight m such that ∀i ∈ [ n 3 ], the i-th triplet, B i , of s has Hamming weight at most one. These sets will play a role in the results that follow.
The first result of Section III-B is Theorem 3. This result presents sufficient conditions for a function on the Boolean cube to be expressible by variational linear quantum models that use the QRAC embedding. Proof. Let m = n 3 . We start by expanding the tensor product in the definition of the quantum state in Equation (37): where |P | is the number of non-identity Pauli operators in the simple tensor P and Consider Φ : {I, X, Y, Z} ⊗m − → K QE m defined as follows. For each i ∈ [m], the bit triplet (Φ(P ) 3i−2 , Φ(P ) 3i−1 , Φ(P ) 3i ) has a 1 in the first, second, or third position if and only if P i is X, Y, or Z, respectively, and it is a triplet of zeroes otherwise. We choose O in the following way: O := P ∈{I,X,Y,Z} ⊗m 2 m 3 |P |/2 g(Φ(P ))P .
Since Φ is a bijection by construction, for each s ∈ K QE m we can associate the Fourier basis element χ s with a χ P in Equation (40). It follows that as required.
Similar to Section III-A, we define the Fourier coefficients of a variational linear quantum model that uses the QRAC embedding to be As mentioned in Section III-A, this definition will be used in Section IV to see how well the variational model was able to fit the Fourier coefficients of the target function. Lastly, similar to the phase embedding case O can be diagonalized into V DV † , where D ∈ span R {I, Z} ⊗m . We again leave as an open question if interesting classes of functions can be expressed when the spectrum of O θ is fixed.
Before moving to our next result, we introduce another concept. With respect to an initial ordering of the input variables (b 1 , b 2 , . . . , b n ) = b, we define the τ -permuted model to be where τ is any element of the symmetric group on n elements, S n . An element τ ∈ S n acts on the tuples b ∈ B n by permuting the order of the entries, where τ (b) denotes the permuted tuple. The action of τ naturally extends to sets, and so it follows that τ maps K QE Thus, permuting the input bits expands the class of functions expressible by a single VQML model using the QRAC embedding. Next, we present Theorem 4 that will be useful for extending the class of functions we can represent with the QRAC embedding and applies to τ -permuted models.
Theorem 4. For any g ∈ G, if ∃τ ∈ S n such that ∀s ∈ B n the condition g(s) = 0 =⇒ s ∈ τ (K QE Proof. The main argument is based on the simple fact that Let Φ be as defined in the proof of Theorem 3, and suppose for P ∈ {I, X, Y, Z} ⊗m , s = Φ(P ) and P is such that Φ(P ) = τ (Φ(P )). This P exists because Φ is bijective. Then, using Equation (47), it follows that If we replace g(Φ(P ))P with g(τ (Φ(P )))P in Equation (42), then the rest follows by using the same arguments made when proving Theorem 3.
The class of functions that can be represented by ensembles of models, Equation (18), using the QRAC embedding is summarized in the following result. The proof makes use of techniques that are similar to those used in proving Theorem 2.
Proof. Let m = n 3 , by hypothesis, g satisfies: For any s such that wt(s) ≤ m, let k s be the number of τ ∈ S n such that s ∈ τ (K QE m ). It can be easily seen that k s = 0 for all such s because τ ∈Sn

VOLUME 4, 2016
It is possible that for two different τ, λ ∈ S n where 0 is the n-tuple with all zero entries. One reason is that K QE m has a nontrivial stabilizer group under the action of S n . For example, a permutation that just changes the order of s 3i−2 , s 3i−1 , s 3i for some i ∈ [m] and all s ∈ K QE m is a nontrivial stabilizer. Thus multiple permuted models can effectively be identical, i.e. K QE m can equal τ (K QE m ). However, the proof still works if we do not exclude such cases.
Next, for every τ ∈ S n , we define a new function g (τ ) : B n − → R as follows: By invoking Theorem 4, for each τ ∈ S n , there exists an observable O (τ ) such that ∀b ∈ B n one has that Thus, we will make use of the following ensemble: Since each g(s) ks χ s (b) appears k s times in the sum in Equation (53) the result follows.
Similar to the ensemble of phase-embedding-based models, we can utilize a validation data set to determine if the size of the ensemble is sufficient. In addition, we note that a model that makes use of U QE may be less susceptible to overfitting due to higher-order Fourier basis elements not being accessible. We note that Theorem 2 implies that for any function with degree d ≤ n 3 , there exists an ensemble of n n /3 phase embedding-based models using n 3 qubits that can express the function. Since of course n!, i.e. the cardinality of S n , is larger than n n /3 , an ensemble of phase embedding models would be more desirable in this case. However, both sufficient conditions still require factorially many models, which can become intractable. We leave as an open question if a smaller ensemble of QRAC-based models is sufficient for expressing interesting functions with d ≤ n 3 We note that a single QRAC-embedding based model still has some beneficial properties. For example, a single phase-embedding-based model using m < n qubits can only contain Fourier terms that involve m out of the n input variables, i.e. is an m-junta. However, a single QRAC-based model can express functions that are dependent on every input variable.
Lastly, a single linear quantum model using multiple consecutive QRAC embeddings can express a larger class of functions than what was mentioned in Theorem 3. Consider replacing U QE with where the V k are arbitrary unitary operators that are may or may not be trainable, and there are r data-encoding steps. In Appendix A-B, we present a concrete example of the unitary operators in Equation (54) that produces a linear quantum model on a single qubit whose output is expressible as a nontrivial linear combination of all Fourier basis elements for B 3 . This alternative operator, in the case where V k are not trainable, could be used in place of U 3,1 in Equation (38) in the multiqubit case. However, the degree of freedom that the trainable part of the model, O θ , has in choosing the coefficients of the χ P is limited when compared to the ensemble approach.

IV. EXPERIMENTS
We present some experiments, in simulation and on hardware, to demonstrate scenarios in which it is possible to use the phase/QRAC embeddings in a variational linear quantum model to fit low-degree functions on the Boolean cube. All experiments were performed utilizing the Qiskit [86] machine learning framework. The code for executing the experiments in simulation is available online at https://doi.org/10.5281/zenodo.7805753. The goal is to show the expressivity of the models, i.e. demonstrating the theory in action, rather than assessing their ability to generalize to unseen data. Thus we provide the models access to all of the data to train on. More explicitly the training set is T = {(b, g(b)) | ∀b ∈ B n } for fitting the target function g : B n − → R. As discussed in Section II-B, the goal of such a supervised learning task is to minimize Equation (12). The loss function utilized for each experiment below is the square error defined as where f θ is the model and g is the target function. We did not utilize regularization in any experiment, and thus the regularization term, in Equation (12), is zero. In the QRAC embedding case, for simplicity, we only make use of a single linear quantum model instead of an ensemble. Employing the notation from the previous sections, for all experiments, W (θ) is an m-qubit PQC consisting of singlequbit rotation gates, R Y and R Z , and two-qubit controlled-Z gates using nearest-neighbor connections. Lastly, O θ = W † (θ)Z ⊗m W (θ). Here we have chosen D from Section II-B to be Z ⊗m . Such a selection of D happens to be sufficient for the functions we consider in our experiments. As mentioned in Section III-A this is not sufficient in general for either embedding. The functions were chosen this way so that the number of circuit runs on hardware could be reduced. The goal is to find a parameter setting for θ such that O θ implements an observable O (g) satisfying the property: For each simulated and experimental result we display the functions' values for different Boolean inputs as well as the Fourier coefficients of the learned quantum model in order to show the alignment between the predicted values and the experimental results. For both embeddings, using the final values of the parameters obtained at the end of training, we classically computed the matrix for W (θ), which corresponds to the trainable part of our model. Subsequently, we computed the matrix for W † (θ)Z ⊗m W (θ), which equals O θ . The Fourier coefficients of the linear quantum models were computed using Equations (27) and (44) and the matrix for O θ . Because the number of circuits (2 × number of parameters × size of training set × number of iterations) scales quickly for implementing optimization with the parameter-shift rule, we utilized the COBYLA [87] optimizer instead of standard parameter-shift rules and minibatch learning for both simulation and hardware experiments. The figures that follow later clearly show both embeddings were able to fit the target function of n bits, with the QRAC embedding using only one-third of the qubits compared to the n-qubit phase embedding.
In Figure 2 we show experiments utilizing both the phase and QRAC embeddings to fit the function This functional form was chosen because, as shown in Section III-B, a single variational linear quantum model using QRAC without permuting the input can represent at most a degree 1 function using a single qubit. The values of the coefficients, a i , were chosen so that setting D = Z ⊗m would be sufficient to express g 3 . Three qubits were used in the phase embedding case and one qubit was used in the QRAC embedding case. The circuits that we used are displayed in Figure 1. For the hardware experiments, we applied readout-error mitigation and dynamic decoupling [88] implemented within Qiskit. Simulation was performed utilizing the statevector simulator. The hardware experiments were performed on the 16-qubit ibmq_guadalupe device. The phase embedding circuit used qubits 5, 8 and 9 and 300 iterations of the COBYLA optimizer, and the QRAC embedding circuit used qubit 8 and 150 iterations of the COBYLA optimizer. We executed 10, 000 shots for each experiment so that readouterror mitigation could be applied.
Similar to the experiments shown above for the function g 3 (b) with 3-bit inputs, in Figure 4, we present experimental results for learning the following function that depends on 6-bit inputs: The functional form of g 6 was chosen for similar reason that g 3 was chosen in the previous experiment. A variational linear quantum model using QRAC on two qubits without permuting the input can express at most degree 2 functions. The coefficients were again chosen so that Z ⊗m would be sufficient as an observable. The circuits used are presented in Figure 3. Here six qubits were used for the phase embedding case while two qubits were used for the QRAC embedding case. Simulation was performed utilizing the statevector simulator. The hardware experiment for the QRAC embedding case was performed on the 7-qubit ibmq_casablanca device. The circuit used qubits 1 and 2 and 200 iterations of the COBYLA optimizer. Again, 10, 000 shots were executed for each experiment so that readouterror mitigation could be applied. In this experiment we again observed close agreements between the predictions based on the theory and the experimental results.

V. CONCLUSION
We summarize the results obtained here and give a few remarks on the implications of our findings. First, we have used Fourier analysis to provide sufficient conditions for a function on the Boolean cube to be expressible via variational linear quantum models or ensembles of variational linear quantum models utilizing the phase and QRAC embeddings. We showed that for any function on the Boolean cube there exists a variational linear quantum model based on the phase embedding that can represent it (Theorem 1) and an ensemble of such models that can represent any degree d function with d qubits (Theorem 2). These result narrows down sufficiency conditions for the representability of functions on the Boolean cube. Previously known results were proven for functions in L 2 ([0, 2π] n ), where representability sufficiency was achieved outside of the linear model framework. This was done by showing that repeating the phase embedding r-times sequentially (data reuploading) or in parallel approximates the r-th cubic partial sum of a function's Fourier series (Equation (20)). We then showed, via Theorem 3 and Theorem 4, that a single linear quantum model using the QRAC embedding can express low-degree functions on B n , if the function satisfies the property that the Fourier coefficent of χ s being nonzero implies that s ∈ K QE ) if we permute the input by τ ∈ S n . Lastly, we demonstrated that ensembles of linear quantum models that use quantum random access codes can represent functions on the Boolean cube with degree d ≤ n 3 (Theorem 5). The variational linear quantum models presented for learning functions on the Boolean cube can be easily applied to problems involving other discrete domains by converting integer representations to binary. Machine learning problems involving discrete-valued inputs appear frequently in industrial settings. For example, categorical features are known to be essential for machine learning tasks in financial [89] and healthcare applications [90].
The results presented can be expanded in different directions. Future research can benchmark model ensembles that use the phase or QRAC embeddings. It would be interesting to further study the impact of classical preprocessing on VQML models, which we showed to be beneficial for both embeddings. Potentially, similar expressivity theorems, like those in Section III, can be demonstrated for linear quantum  2: Simulator and experimental results obtained from using (a) the phase embedding with three qubits and (b) the QRAC embedding with one qubit to fit the function g 3 with a 1 = 1 2 , a 2 = − 1 10 , a 3 = 1 4 . "Target" represents the exact outputs and Fourier spectrum of g 3 . Both methods successfully fit the target function with high accuracy. models that operate on discrete domains beyond the Boolean cube. For example, quantum computation is already known to provide significant computational speedups for problems involving finite Abelian groups [91].
Furthermore, subsequent work could also compare the expressivities of classical neural networks to VQML models.
One could obtain an upper bound on the required size of the neural network using the fact that an arbitrary real-valued function on the Boolean cube is a linear combination of parities. It is folklore that a single hidden layer of size two suffices to express XOR on two input bits. Thus if one uses a divide-and-conquer approach, then a d-bit parity can be  This appears to be comparable to VQML case. For a function with an exponentially large set of nonzero Fourier coefficients, the neural network may require exponentially many neurons. In this case, VQML may require a diagonal observable that decomposes into exponentially-many elements of span R {I, Z} ⊗n and require an exponentially deep PQC. However, the Fourier space representation of VOLUME 4, 2016 the function may not be the most computationally efficient form, and thus the classical neural network could use fewer resources. Nevertheless, we note that uniquely in quantum, one can have a trivial learner from Fourier sampling that may give quantum advantage [92] providing access to uniform quantum examples (see also Appendix A of [93]). We leave a detailed comparison of these models as the topic of future work.
We performed proof-of-principle numerical experiments and executed the algorithms on IBM quantum processors. These experiments demonstrated that it is possible for a variational linear quantum model using the embeddings presented to learn sufficient parameters to express lowdegree functions. In future developments, one could study the ability for such models to generalize to unseen data, i.e. truly learn, and quantify the required number of training samples needed to learn functions, such as low-degree kjuntas. For simplicity, we setup the problem scenarios so that D = Z ⊗m was sufficient for all learning tasks. However, it would be interesting to experiment with more complicated problems where such a simple observable does not suffice. Potentially, there exist interesting classes of functions that can be expressed with a fixed D that is a linear combination of only polynomially (in the number of qubits) many Pauli terms. Lastly, there might be cases where we can exploit the structure of the problem to design efficient PQCs, particularly for near-term quantum hardware, for learning functions on the Boolean cube. .
(62) Suppose that m = n r and that each ν wj partitions the n inputs bits into r-tuples of size m. Then it follows that the above reduces to where Note that the set contains all elements of B n . Thus, it is possible for the Fourier spectrum of this model to have support on any of the Fourier basis elements, which implies an increase in expressivity.

B. QRAC EMBEDDING
In this appendix, we present an example that shows that using multiple consecutive QRAC embeddings does enrich the class of functions that a single linear quantum model using this embedding can represent. Let R n (θ) = e −i θ 2 (n1X+n2Y+n3Z) , where n ∈ R 3 and n 2 = 1. We will consider replacing U 3,1 with where n = 1 √ 3 (1, 1, 1). Then, + (a 8 + a 9 (−1) b3 + a 10 (−1) b1+b2 + a 11 (−1) b1+b2+b3 + a 12 (−1) b2+b3 + a 13 (−1) b2 + a 14 (−1) b1+b3 ) Tr[O θ Y] + (a 15 + a 16 (−1) b1+b3 + a 17 (−1) b2+b3 where the a i are fixed, but we have abstracted them out because our focus is on the number of Fourier basis terms, (−1) s·b . This model is a linear combination of all Fourier basis elements for functions on B 3 . The trainable component of the model determines the values for Tr[O θ P ], where P is a Pauli operator. For a given Pauli operator this value is shared by more than one Fourier basis element. For obtaining the expansion above, it is helpful to express the R Z and R Y rotations involved in U 3,1 in terms of b 1 , b 2 , b 3 as follows: where c (Z) and c (Y) as well as The functions φ Z and φ Y are defined in Section III-B with α 1 = π 4 and α 2 = 2 cos −1 1 2 This formulation introduces dependence on terms of the form (−1) s·b when expanding the expectation.

APPENDIX B USING VARIATIONAL SWAP NETWORKS IN THE PHASE EMBEDDING
When using the phase embedding, after loading the input bits onto a register with X gates, we can apply a layer of variational SWAP gates, i.e. e −i β 2 SWAP , with learnable parameters β. The layer consists of one variational SWAP between every pair of qubits, i.e. n 2 = n(n−1) 2 gates. This allows for testing multiple combinations of the k out of n input bits in superposition. Specifically setting all β's to π 2 produces a uniform superposition containing all possible subsets of k bits that can be swapped into the first k bits. One motivation behind adding the SWAP network is due to the following lemma. Proof. We can find a bijective mapping between the k relevant bits and the first k qubits, potentially acting as the identity on some qubits. This map can be expressed as a product of transpositions of the n input elements that do not act on the same qubit. Thus, any all-to-all variational SWAP network can implement this map by enabling/disabling the relevant SWAPs.
As a proxy for variational SWAPs, one could use the particle-preserving XY gate [7]. The benefit of a variational SWAP network in practice would require further experimentation. For QRAC, it appears that we would need to encode the n input bits into an additional quantum register destroying the constant-factor reduction in qubits. Also, the bits-to-angle mapping for QRAC would need to be implemented coherently and the rotation gates controlled on additional ancillas.

APPENDIX C GENERALIZATION BOUNDS
While our focus is on expressivity, we can almost trivially apply one of the generalization bounds obtained by [35] to obtain one for the phase embedding and QRAC embedding. The following is the definition of a variational linear quantum model, f θ that was presented in Section II-B: where O θ := W † (θ)DW (θ).
The operator D is an observable that is diagonal in the computational basis, and W (θ) is a parameterized-unitary operator. The unitary used to prepare the feature state ρ(b) can be the phase or QRAC embedding.
Theorem 6. Let n, m ∈ N, and : R × R − → [0, c] be a loss function that is β-Lipschitz in the second coordinate. In addition, consider an arbitrary δ ∈ (0, 1) and arbitrary probability measure µ on B n × R. Furthermore, suppose f θ is variational linear quantum model using either the phase or QRAC embedding, then, with probability ≥ 1 − δ over the choice of an i.i.d training set T , the model f θ satisfies: where R andR T are the generalization error and training error respectively. Additionally,Õ suppresses polylogarithmic factors.
Proof. By Equation (25) the phase embedding can be viewed as a Pauli encoding with restricted domain. Thus Corollary 14 result (a) from [35] applies, and in the phase embedding case n-encoding gates are used for an n-dimensional input. Since quantum models based on the QRAC embedding can be viewed as PQCs with Pauli encodings on a 2n 3 -dimensional domain defined by φ Z and φ Y using 2n 3 -encoding gates, the same bound applies to QRAC-based models. Since W is unitary, O θ 2 = D 2 , and so the bound applies to O θ too.

APPENDIX D EXPERIMENTAL DEVICE PARAMETERS
Here we report the experimental device parameters for each of the hardware experiments presented in Section IV. The experiments to fit functions on B 3 were carried out on the ibmq_guadalupe device, where qubits 5, 8 and 9 were used for the phase embedding experiment and qubit 8 was used for the QRAC embedding experiment. The experiment to fit a function on B 6 was carried out on the ibmq_casablanca device using qubits 1 and 2.