The Quantum Path Kernel: A Generalized Neural Tangent Kernel for Deep Quantum Machine Learning

Building a quantum analog of classical deep neural networks represents a fundamental challenge in quantum computing. A key issue is how to address the inherent nonlinearity of classical deep learning, a problem in the quantum domain due to the fact that the composition of an arbitrary number of quantum gates, consisting of a series of sequential unitary transformations, is intrinsically linear. This problem has been variously approached in literature, principally via the introduction of measurements between layers of unitary transformations. In this article, we introduce the quantum path kernel (QPK), a formulation of quantum machine learning capable of replicating those aspects of deep machine learning typically associated with superior generalization performance in the classical domain, specifically, hierarchical feature learning. Our approach generalizes the notion of quantum neural tangent kernel, which has been used to study the dynamics of classical and quantum machine learning models. The QPK exploits the parameter trajectory, i.e., the curve delineated by model parameters as they evolve during training, enabling the representation of differential layerwise convergence behaviors, or the formation of hierarchical parametric dependencies, in terms of their manifestation in the gradient space of the predictor function. We evaluate our approach with respect to variants of the classification of Gaussian xor mixtures: an artificial but emblematic problem that intrinsically requires multilevel learning in order to achieve optimal class separation.


Introduction
Bridging classical deep neural networks and quantum computing represents a key research challenge in the field of quantum machine learning [1,2].The potential for improvement offered by quantum computing in the machine learning domain may be characterized in terms of its impact on algorithmic efficiency, generalization error, or else its capacity for treating quantum data [3].
A notable recent result in the field has been the introduction of the concept of variational quantum algorithms and the related neural network analog referred to as the quantum neural network (QNN) [4].This, in essence, consists of a feature map encoding data into a quantum Hilbert space upon which certain parameterized unitary rotations are applied prior to final measurement in order to obtain a classification or regression output.The system as a whole is then optimized by classical methods.Such models provably lead to a computational advantage over classical models on certain artificial tasks [5], and in respect to the analysis of specific physical systems [6].It has been quantitatively shown that QNNs can be trained faster than their classical analogues [4].However, QNNs remain problematic in various respects.One limitation arises from the so-called barren plateau problem [7], in which the variance of the gradient vanishes exponentially with the system size as the parameterized transformation becomes increasingly expressive [8].A number of approaches, including layerwise training of quantum neural networks [9], have been proposed to mitigate the issue.
A second problematic aspect of QNNs, and the one that constitutes our principal focus here, is the linearity of the dynamics of quantum systems.Concatenations of linear unitary transformations remain unitary and thus 'stacked' quantum transformations, in effect, collapse to a single linear transformation, appearing to rule out de facto the hierarchical feature learning of classical deep neural networks, which relies on non-linearities to separate feature layers.This property makes the QNN essentially a kernel machine [10].In terms of the predictor function, however, the QNN is composed of multiplications of rotation operators parameterized by both the feature and model weights.The nonlinearity of projections of rotation operators can be exploited to replicate a very constrained form of non-linearity for feature learning [11].Another strategy is to introduce nonlinearity via the measurement operation, i.e. a dissipative QNN [12].Both approaches involve the projection the quantum state into a subspace of the original Hilbert space.
Much of the recent study of the dynamics of deep neural networks in the classical realm has focused on the Neural Tangent Kernel (NTK) [13] which represents the network in terms of the corresponding training gradients in the model parameter space.The NTK hence approximates the behavior of predictors via a linear model.It is often therefore applied to study neural networks in their asymptotic, infinite-width, limit.In this regime, the network exhibits lazy training [14], i.e. parameter gradients remain at their initial values during the entirety of training.The NTK thus accurately characterizes the dynamics of such infinite-width neural networks, but is otherwise only an approximation [15].The difference in test error between the predictor and its linearized version depends on the problem structure [16], with hierarchical feature learning capability being crucial to obtaining superior performance [17].However, the kernel nature of the NTK means that it shares with quantum computing a ready interpretation within a Hilbert space, and is thus of considerable interest within quantum machine learning.The first explicit application of NTK to quantum neural networks, the quantum neural tangent kernel (QNTK) was given in [18].
In this paper, we propose a method for overcoming the de facto lack of hierarchical feature learning capability in QNNs.We propose the application of Path Kernels [19] to QNNs, which we call the Quantum Path Kernel (QPK).Such an approach generalizes the QNTK so that the resulting kernel is representative of the ensemble of NTKs calculated over the full parameter path trajectory, i.e. the function describing the evolution of model parameters over time, including implicitly any parametric evolutions corresponding to hierarchical feature learning.We show experimentally an increased expressivity of the resulting model relative to linearized equivalents, evaluating our method on the Gaussian XOR mixture classification problem.For this problem, finite-width neural networks have both theoretically and empirically shown to be close-to-optimal performance whereas linear NTK models fail [20], suggesting that it cannot be effectively resolved without implicating multilevel learning behavior.Furthermore, we discuss possible improvements for the proposed approach, which can be obtained by considering only the contribution of the parameter gradient path that gives rise to the most decorrelated feature representation.These specifically corresponds to the contributions associated with the maximally nonlinear point of the parameter path, corresponding to the largest (positive or negative) eigenvalues of the Hessian of the predictor function [21].We further enhance the decorrelation between feature representations via a stochastic, noisy, or non-gradient-descent-based training algorithm in which the averaging operation between decorrelated representations allows us to interpret the model as an ensemble technique.
The paper is structured as follows.In Section 2 we briefly review the necessary conceptual background.In Section 3 we present the Quantum Path Kernel and discuss the hierarchical feature learning of the induced model.In Section 4 we demonstrate how this leads to superior performance in solving the Gaussian XOR mixture classification problem.In Section 5 we draw our conclusions and present directions for further work.

Contributions
• We propose the Quantum Path Kernel as a mechanism for building hybrid classical/quantum machine learning models which are able to emulate the hierarchical feature learning structure of deep neural networks without violating the underlying linearity of the quantum dynamics.
• We provide numerical evidence of the superior performance of the Quantum Path Kernel compared to the QNTK on the Gaussian XOR Mixture problem, which is Bayes optimally soluble only through implicating layerwise nonlinear separability.
• We consider the importance of the extraction of non-correlated feature representations corresponding to maximally varying portions of the parameter gradient path.

Related works
The introduction of the NTK by [13] has marked a significant step in the theory of machine learning, sheding new light on discussions regarding the relative performance of linear and nonlinear models.For example, [16] suggests that tasks in which kernel methods (including NTK) perform worse than neural networks are those in which the kernel suffers from the curse of dimensionality whereas neural networks, in learning some useful lower dimensional representation, do not.One example of such a problem is the Gaussian XOR Mixture classification task [20].Furthermore, linearized models have been shown to perform slightly worse than wide (i.e.large, but non-infinite) neural networks on CIFAR-10 benchmark [22], with the gap between the approaches increasing for finite width networks [23].
In relation to quantum computation, researchers have spent substantial effort on the limitations imposed by the linear dynamics of quantum systems.Authors in [24] review early approaches to the formulation of nonlinear quantum machine learning models: some have focused on developing a quantum perceptron equivalent or quantum neuron, i.e. a candidate building block for the quantum analogue of neural networks; [25] uses phase estimation to implement the functioning of a step function; [26,27] propose to exploit the RUS (repeat until success) policy to mimic the behaviour of tangent and sigmoid activation functions, while [28] uses RUS to construct a Born machine; [29] emulates the nonlinearity of perceptrons using measurements.In relation to QNNs, [30] propose dissipative QNNs in which the nonlinearity is obtained via intertwining measurements between unitary gates; [31,32] propose the use of a larger Hilbert space to implement the nonlinear transformation, while [33] exploits the exponential form of unitary gate to achieve periodic activation functions.Finally, non-linear models of quantum mechanics have been conjectured by [34], although these violate some computational complexity assumptions [35].

Background
This section briefly introduces the key concepts and notations in relation to Deep Learning and Quantum Machine Learning through which we develop our results.We denote by D = {(x i , y i )} n i=1 ⊆ X ×Y a labelled dataset of pairs that are i.i.d.sampled from an unknown probability distribution.We indicate the data vector space with X = R d , and the target space with either Y = R or Y ⊆ Z, |Y| < ∞ for regression or classification tasks, respectively.We indicate uniform sampling from a uniform discrete distribution with ∼ {v i } n i=1 and sampling from a normal distribution of mean µ and variance σ2 with ∼ N (µ, σ).

A primer on quantum machine learning models
Here we fix the notation for our quantum machine learning models.The state of a quantum system of m-qubits is described by a density matrix ρ ∈ H ≡ C 2 m ×2 m .The initial state of a quantum computation is denoted by ρ 0 = |0 0|, and the (possibly parametric) unitary transformations by U, V, W .Any parametric unitary can be written as where α i ∈ {x, y, z, 1} for i = 1, . . ., k, and σ α1,...,α k is a tensor product of one or more corresponding Pauli matrices applied to qubits q 1 , ..., q k .The same transformation may be interpreted as a rotation and be equivalently denoted by R (i1,...,i k ) α1,...,α k (θ), where θ ∈ R P are rotational angles.A quantum neural network is a function of the form1 : where O indicates any measurement operator.Both the matrices U and V are decomposed in single and two-qubits parametric rotations interspersed with non-parametric gates (e.g.CNOT).

Notions of nonlinearity in classical and quantum learning models
With respect to both kernel machines and layerwise deep learning, the concepts of linear model, nonlinear model, and feature learning that we utilize here are as formalized in [37].A linear model is thus a function of the form: where {φ j : X → R} p j=0 are the feature functions, whose values corresponds with the model features.We might consider an additional feature φ 0 ≡ 1 that incorporates the bias.The formula in Equation 3 is linear with respect to the space of the parameters 2 H ≡ R p ; in fact, we can interpret the function as an inner product in that space, i.e.
with θ = (θ 1 , ..., θ p ) and φ(x) = (φ 1 (x), ..., φ p (x)).The optimal parameters of such a model can be found analytically by solving the linear regression problem over the Mean Squared Error loss, which is a convex, quadratic function of the parameters.The representer theorem guarantees that the optimal solution is a span of the m data points of the training set, which is independent from the dimensionality n of the space H. Obviously, a model which is linear in the parameters may well behave nonlinearly with respect to the original feature space X , due to the feature functions.
A nonlinear model is a function of the form: The higher-order terms of the expansion are characterized by their own set of features, e.g.{ψ j,k : X → R} p j,k=1 for the second order term.The elements of such sets are unique up to a permutation of their variables, thus the terms 1/2!, 1/3!, ... compensate the multiple counting of such elements in Equation 5.The term 1 adjusts the contribution of the nonlinear terms.If the model is truncated to the second term it is denoted as quadratic model.In such a case, the loss function is quartic, thus we cannot find analytically the optimal parameters as in the linear regression.The dynamic of such a model is described by where φ E are effective feature functions, i.e. features that depend on, and evolve with, the model parameters, which are learnt during the optimization phase.This behaviour can be generalized to consider terms of even higher orders: the presence of order n terms make the feature functions of order n − 1, effectively, which may further influence the lower order terms.Models having effective feature functions have feature learning capabilities.A deep learning model is both capable of feature learning and composed of several nonlinear modules arranged in a hierarchical fashion [38]; such that differing layers can follow differing (albeit hierarchically conditioned) gradient paths.Turning to QNNs, the quantum model where , and O Hermitian observable, is a linear model in the space of density matrices of the quantum system H: the trace operation Tr A † B is an inner product for the space of matrices C k×k .Such a property implies that the construction of a layer-wise architecture for v, i.e. v(θ) = i V i (θ) effectively collapses to a single operation: this may add more degrees of freedom to the linear transformation3 but cannot make the model nonlinear in H.
However, in terms of the predictor function f (x; θ), the quantum model does not necessarily fit the form set out Equation 3 since the parameters of the QNN model, namely the angle of rotation operation (in the form of imaginary exponential function), are subject to the trace operation.Thus, for example, consider a single-qubit quantum model acting on a single input x ∈ R 1 , depending on a single parameter θ ∈ R 1 , with feature map U φ (x) = exp(−ixσ x ), variational form V (θ) = exp(−iθσ x ) and measurement operator i O = σ z , in which case f (x; θ) has the form: which is nonlinear in its weights.Clearly, if we were to consider a model other than a QNN then the predictor function would change, for example as in [29], however it does not alter our argument here.
To recap, a QNN is a linear model in the Hilbert space of the density matrices due to the linearity of the evolution of closed quantum systems.However, its predictor is nonlinear in the parameter θ since its structure results in a composition of trigonometric functions.This potentially allows a limited degree of representational learning capability if aggregated layer-wise (limited in the sense of applying only to a highly constrained set of activation functions).However, due to the Lie algebraic equivalence of any given sequence of quantum transformations to some single unitary operation in the absence of the trace operation, we are still not able to characterise truly deep models in the quantum domain.

Characterization of model dynamics through the Neural Tangent Kernel
The output f (x; θ) of a machine learning model trained via (possibly stochastic) gradient descent can be approximated as a first-order Taylor expansion f . Such an approximation allows the representation of machine learners as linear (kernel) models via the Neural Tangent Kernel (NTK, [13]): Such a tool has been used in [14] to characterize the dynamics of infinitewidth neural networks, in which the NTK is independent of the random initialization and constant in time.On a coarse level of detail, we can assert that model training in lazy-training regime, i.e. when the evolution of θ(t) during the training of the model f (x, θ) closely follows the tangent path, can be decently approximated by the NTK.A more detailed analysis in [15] has revealed that the NTK is constant if and only if the model is linear (in its parameters).Such a result allows us to quantify the nonlinearity of a model through its Hessian norm of the predictor function: if H f ∇ w f then the model is nearly linear.This has been used in [11] to analyze the behaviour of the QNNs in the lazy training regime.

The Quantum Path Kernel Framework
No extant quantum method is thus able to fully capture the deviations from gradient path linearity manifested by empirically optimal learners in the classical domain.Hence, in order to encompass the concepts of hierarchicality and feature learning in (implicitly kernel-based) quantum machine learning models, we here introduce for the first time in the quantum realm a key idea of Domingo's [19], namely Path Kernelization.
Within this paradigm, for any machine learning model f θ (x) whose parameters θ are learned from a set D = {(x i , y i )} n i=1 by gradient descent via a differentiable loss function, it is possible to express the resulting (i.e.trained) classifier as: where is the Path Kernel, i.e. the line integral of k ntk over the multidimensional curve representing the evolution of the parameters θ = γ(t), t ∈ [0, T ] during training, with θ = γ(T ).In general, chain rule dependencies arising from the specifics of the architecture of the network will imply hierarchical dependencies among the parameters during learning.The result holds even for stochastic gradient descent optimization, in which case Equation 13 is a stochastic integral.However, it is not immediately clear that this path integration obeys Mercer's conditions; while it is generally true that a convex sum over Mercer kernels is itself a Mercer kernel, the path over which we are integration is here dependent on the training objects.We therefore dedicate Appendix A to proving that the Path Kernel is effectively Mercer, and set out the pseudocode for its construction.
It is thus central to our argument to examine the parameter path γ and its morphological evolution.For linear models, assuming a vanilla gradient descent training over a convex loss function L, the parameter path is described by a linear vector {(1 − t)θ 0 + tθ f | t ∈ R} where θ 0 ∈ R p are the parameters at their initialization, and θ f ∈ R p are the parameters at their convergence on the (ideally global) minima of L. In such a case, it is immediately possible to check that the derivative of the linear model ∇ θ f is independent of θ, and thus that the NTK is constant.For nonlinear models, the loss function L may become non-convex and γ is not constrained to be a linear trajectory.In this latter case, both the ∇ θ f and the NTK will vary in time.
In this work, we will not focus on the possible role of Path Kernels in approximating nonlinear models.Instead, we shall exploit the intrinsically hierarchical structure of the Path Kernel to implement a hybrid deep machine learning model within a quantum neural network setting.We depict the construction of this object in Figure 1.The parameter trajectory for a nonlinear model is described by a complex, non-straight curve.Each point of the parameter path θ t = γ(t) may be used to define a new kernel representation for the training data, namely k ntk (x, x ; θ t ).We can then define a sequence of kernels stacked in a hierarchical way (whose structure, in passing, resembles the layers of a deep neural network, though this observation is peripheral to the argument being made here).Thus, each new "layer" is a source of representation learning: the new representation (i.e.kernel matrix) is the result of an optimization process that further adapts the previous representation to the given data discrimination problem (which resembles, though is again not equivalent to, classifier boosting).
It thus becomes possible, via explicit substitution for the corresponding Quantum NTK previously defined, to construct a Quantum Path Kernel (QPK) as follows: where Equation 14defines the QNTK as its classical analog and is equivalent to Equation 15 except for the integration with respect to time.Equation 16is the discretized version of the preceeding equations, corresponding to actual implementation in a gradient descent-trained model.The resulting Quantum Path Kernel (QPK) is consequently both a quantized version of Domingo's Path Kernel as well as a generalization of the Quantum NTK, one that is implicitly capable of embodying the complex parametric interactions (such as transient parametric co-evolutions) that occur during learning in order to arrive at the final trained model, including those implicated in hierarchal feature learning.

The Quantum Path Kernel as a generalization of Quantum Neural Tangent Kernel
In interpreting the Quantum Path Kernel as a generalization of QNTK for models exhibiting nonlinear behavior, it may be seen that the QNTK is constant only when independent of θ, in which case: That is, the Quantum Path Kernel becomes identical to the Quantum Neural Tangent Kernel.However, as set out in section 2.2, the particular structure of QNNs will, of itself, give rise to a nonlinear predictor.Thus, in principle, the QNTK would not be expected to be constant in output terms in the finite width regime [11].However, a close-to-constant behavior can be expected for quantum machine learning models whose training is lazy (i.e.lazy training induced via overparameterization of the QNN, such that the large number of parameters result in a simplified loss landscape [39,40], leading to rapid convergence to a global minima).

Decorrelation in feature representation
The Quantum Path Kernel clearly exhibits dependency on the training initialization: different initial parameter values, optimization algorithms or learning rates may lead to differing QPK matrices.In particular, the utilization of 'vanilla' gradient-descent optimization algorithms, with a fixed number of training epochs, may introduced subtle biases in the QPK.For example, if training were to converge rapidly, any contribution between the instance of convergence and the end of the training will be effectively identical and oversampled: this contribution will hence outweight the others, biasing the 'stack' of aggregated kernel matrices toward its final layer, as per 1.
To avoid this, more sophisticated optimization algorithms can be considered.For example, the ADAM optimizer adaptively increases the learning rate in locally convex portions of the loss landscape, leading to fewer similar contributions within the path kernel.Furthermore, it is possible to perturb parameter paths via stochastic, noisy or non-gradient-descent-based optimization techniques in order to decorrelate subsequent contributions to the QPK.Having different, highly decorrelated contributions would allow us to interpret the QPK as an ensemble technique analogous to bootstrap aggregation (bagging) often used for tuning the bias/variance trade off in classical machine learning.(Multiple Kernel Learning [41] might also be used to optimally weight individual contributions over the kernel at the expense of interpretability in path terms) .
Appendix A.3 discuss implementation details for the QPK and its tested variants.We therefore now turn to an examination of the test regime.

44].
In particular, the Gaussian XOR Mixture classification problem is an important benchmark for highlighting layer-wise learning capabilities of a model (or the lack of them), in that it intrinsically requires a two-layer solution in order to achieve Bayes optimal class separation.Theoretical evidence has shown that kernel methods, in particular those with random features, struggle to accurately classify XOR data vector mixtures [20].In Appendix B we further analyze the problem, reproducing the results of [20], and proposing an interpretation of the success of feature learning models in tackling the Gaussian XOR Mixture problem.
Our experimental workflow is pictured in Figure 2. Firstly, we generate the dataset for the above described problem.Secondly, we train several QNNs to best fit the generated data.Thirdly, we use the training information to create the QNTK and QPK matrices; the latter are used to train a kernel machine (specifically the Support Vector Machine) to obtain final classifications.Then, our analysis begins with convergence study of the QNNs with an increasing number of layers, to highlight the effect of architectural parametrization in QNNs.Finally, we compare the performances of the QNTK and QPK approaches in terms of testing and training accuracy.The simulation details are shown in Appendix C.

Experimental Setup
The ground truth Gaussian XOR Mixture dataset is specified by d the dimensionality of the features, d ≤ d the number of non-zero features representing the multidimensional Gaussian XOR Mixture, ¯ the variance of the Gaussian noise, and n the number of data points; it is composed as follows: where x i ∼ {±1}, i ∼ N (0, ¯ ) for i = 1, ..., d , and y i = d i=1 x i .Such a dataset is optimally classified via the oracle function We generate multiple datasets D d,d , ,n having feature dimensionality ranging in d = 2, 3, ..., 10, noise ranging in = 0.1, 0.2, ..., 1.0, number of non-zero features fixed to d = 2, and number of elements fixed at n = 32.Then, each dataset has been randomly partitioned into a training set D train and a testing set D test .
Each dataset is processed by a distinct quantum neural network, each sharing the same structure described by: with data encoding: such that the trainable ansatz is described: with the L hyperparameter representing the number of layers of the model.Finally, the observable is z .This data encoding is been chosen for its simplicity: the encoding of one feature for each qubit results in a constant-depth circuit.The choice of the trainable ansatz, though, is particularly important: the underlying functional transformation has the potential to be affected by barren plateau issues if it is too expressive [8], for example when the parametric transformation is able to approximate any arbitrary unitary matrix.The expressibility of a quantum transformation can be examined using Lie-algebraic tools as shown in [45].Among the class of unitaries that are non-maximally expressive, we have selected a specific form that has empirically demonstrated favorable trainability as detailed in [40,Fig. 7a].The choice of the observable is also guided by the necessity of avoiding the barren plateau issue.According to [46], global observables are likely to exhibit vanishing gradients; we thus apply the simplest possible classifier observable acting on a single qubit.The circuit is pictured in Figure 3.In our experiment, the observed qubit is the uppermost; although any other qubit choice would result in a similar predictor due to the symmetric structure of the circuit.
Each dataset is processed with the above described QNN employing a number of layers ranging from L = 1 to 20.According to [47], the QNNs should be initialized at θ = 0 to avoid further trainability issues.However, we do not need to consider such initialization strategy for the variational unitary since the previous expedients were sufficient to allow successful training.Thus, the parameters θ j are sampled from a standard normal distribution.Each QNN is trained using the stochastic gradient-descent algorithm ADAM for 1000 epochs using an initial (adaptive) learning rate η = 0.1.The loss function is either BCE or MSE and, for the sake of simplification, the batch size is equal to the total cardinality of the training set.In the experimental setup described above, we study, both epoch-wise and depth-wise, the effect induced by different initialization parameters on the convergence of the loss function during training.

Results
We evaluate the depthwise convergence characteristics of the respective f (x; θ) models in terms of the corresponding accuracies of the Quantum Path Kernel and Quantum NTK under SVM final classification.Of particular interest is evaluating the closeness of models to the lazy training regime, indicative of the model being near to linear.Lazy training, in classical machine learning, typically occurs for very wide neural networks with the loss decreasing to zero exponentially rapidly, while network parameters stay close to their initialization values throughout training.In the current context, this would correspond to the Quantum Path Kernel collapsing to the Quantum Neural Tangent Kernel, and we would anticipate convergent classification performances for the two approaches.
We therefore evaluate training loss for each of the QNN models over the respective training epochs with an increasing number of QNN layers L = 1, ..., 20.This will be used to determine proximity to the lazy training regime (i.e.identifying if the QNN converges exponentially fast to zero loss).We additionally plot the norm difference between the parameters during training compared to their initialization values.These will be used to determine the extend to which parameters vary from their initialization, indicative the training richness of models in the non-lazy training regime.
We are also interested in determining the robustness of the classifiers to stochastic noise influences during training and their corresponding resilience to overfitting (or the extent to which benign overparameterization [39] effects exists), measured in terms of generalization performance.Therefore, the above evaluations are repeated for datasets additively noise-perturbed in an increasing signal-to-noise ratio.
Finally, we are interested in comparing the generalization performances of our approach to that of the QNTK.For this, we evaluate test accuracy score for the QPK and QNTK, against the oracle.Superior performance of the QPK, in solving the Gaussian XOR Mixture problem, will be taken to be indicative of superior ability to replicate the layerwise feature-learning capability of classical multilayer networks.magnitude of the parameter vector offset from initialization:

Depthwise convergence characteristics
where θ(0) is the value of the parameters at their initialization, and θ(n) is their value at the n-th epoch.
It is evident that none of the models reach the interpolation threshold [48] -i.e. the point at which the training data is fitted perfectly with zero training error.To fit the training dataset we would need at least 32 parameters (2 nonzero coordinates per point per 16 points).However, we are not able to reach the interpolation threshold even in the deepest configuration with a total of 40 parameters.This behaviour is expected by the choice of a parametricallyconstrained U in effect acting as a form of regularisation.As in the classical DNN case, an increasing number of parameters results in a decrease in the loss (Figure 4b-4e-4h), and in an increase in the proximity between the parameter vectors and their initialization (Figure 4c-4f-4i).
We can conclude that none of the QNN models exhibit evidence of lazy training.In particular, while models having a higher number of parameters do indeed converge more rapidly, parameters are nonetheless varying substantially from their initialization.This behaviour is even more noticeable in the smaller models, with a norm difference oscillating substantially prior to the convergence.Such non-trivial training is suggestive of the QPK differing largely from the QNTK in its training characteristics.models both perform similarly at low signal-to-noise ratios, it is particularly striking to observe the outperformance of the QPK over the Quantum NTK with increasing hierarchical depth at the highest signal-to-noise setting.. Figure 6 indicates the training accuracy with depth at the point of convergence.It may be observed that the QPK exhibits lower loss than the Quantum NTK across the full signal-to-noise range, with the effect becoming more marked at higher noise levels (ultimately over-fitting relative to the noise-free oracle in panel c), consistent with the expectation that QPK has a lower bias than the Quantum NTK.
In sum, results confirm the anticipated improvement in performance for the QPK over the QNTK in the Gaussian XOR mixture setting.

Conclusion and Further Work
We have introduced the Quantum Path Kernel as a mechanism for incorporating key complex classical multi-layer network learning behaviors, in particular hierarchical feature learning, within quantum neural networks via an appropriately expressive kernelization of the training process.We evaluate our approach on the Gaussian XOR mixture classification problem, a straightforward benchmark of multilayer learning capacity that requires a minimum two-layer solution in order to approach Bayes optimally.Experimental results indicate superior generalization performance relative to the Quantum NTK, an advantage which is especially pronounced in high-depth, low signal-to-noise settings.
We have shown theoretically that the Quantum Path Kernel converges to the Quantum NTK only in the lazy training regime, i.e. when the training loss decreases to zero exponentially fast whilst model parameters stay close to their initializations across training.Such behaviour is classically seen in infinitewide neural networks, whose behaviour is then close to that of a linear model.
Our experiments, by contrast, indicate that QNNs do not operate in the linear regime.
We have discussed, though do not evaluate in the current paper, the potential for using stochastic, noisy or non-gradient descent based optimization techniques to artificially perturb parameter paths within the QPK in order to implicate more decorrelated feature representations.We, furthermore, propose in future to extend the QPK approach via weighting of individual kernel representations in a more heuristic way, for example via Multiple Kernel Learning.We have also referred in passing to the interpretation of the QPK as an ensemble method due to the averaging operation over its kernel matrices.This will be explored more fully in future investigations.
is the Path Kernel, a parametric kernel function (this parameterization has been rendered explicit in current formulation).In this case, γ : [0, T ] → R P is the parameter path as detailed in Section 3 with a terminal parameter value γ(T ) = w.The Neural Tangent Kernel can also be expressed as a parametric kernel, Equation 24 holds under the proviso that f is differentiable in w, and trained via Gradient Descent (GD) for the given training dataset {(x i , y * i )} m i=1 ⊆ R D ×R using the convex differentiable loss function L(w) = M i=1 (f (x i ), y * i ).Equation 24differs from a linear model due to the explicit dependency of the data x in the weights w i , and it remains a matter of discussion as whether the path kernel in fact represents a more generalized model class than that of kernel machines (although it is clearly equivalent for infinitely small learning rates [50]).This debate need not concern us for the present purposes, where the intent is to obtain a class of models capable of representing the network gradient trajectory in a manner expressible on current quantum computers.
As the Path Kernel is not widely deployed in practical machine learning, we detail here some of its properties.In A.1 we prove the Path Kernel is a Mercer Kernel.In A.2 we briefly comment on the proof of [19,Theorem 1].In A.3 we demonstrate a numerical implementation of the Path Kernel.

A.1 Path Kernel is a Mercer Kernel
Given any γ, the function Kpath (x, x ) = K path (x, x ; γ) is a positive definite or Mercer kernel on R D .A Mercer kernel satisfies for all sequence of elements x 1 , ..., x n ∈ R D and constants c 1 , ..., c n ∈ R.
It is straightforward to demonstrate that such a condition is valid of the Path Kernel.Firstly, Ktang (x, x i ) = K tang (x, x i ; w) is a positive definite function for any w in consequence of the positive definiteness of the Gram matrix of inner products in its parameter space R P .Secondly, since both the positive combination and the infinitesimal limit of combinations of positive definite kernels still satisfy the Mercer condition, then the preceding is immediately valid for the Path Kernel in both its discrete and continuous formulations.

A.2 Comment on Theorem 1 in Domingo's work
In this section we comment on [19,Theorem 1] in order to highlight some of its limitations.The dynamics of any predictor under training via gradient descent may be described by a first-order non-homogeneous differential equation: where f (x; w) : R D × R P and L is the convex differentiable loss function.We can describe these predictor dynamics over training in terms of the Tangent Kernel: In the limit → 0 we obtain: Such a function cannot be straightforwardly represented as a linear model.However, by multiplying and dividing by the Path Kernel itself we obtain the following equation, at the cost of introducing a dependency of x in the model parameters: procedure CreateNeuralTangentKernel Input: Output: real symmetric matrix n × n representing the neural tangent kernel of f over the given dataset.
▷ Start procedure M ← zero filled n × p matrix for i ∈ 0, ..., n − 1 do Output: real symmetric matrix n × n representing the Path kernel of f over the given dataset.
▷ Start procedure M ← zero filled t elements array for j ∈ 0, ..., t − 1 do w Various works have suggested that imposing stronger assumptions on training can remove the dependency of x in the model parameters.For example, the authors in [50] achieve this by imposing a requirement that the loss derivative is of constant sign during training.

A.3 Numerical calculation of the Path Kernel
We can calculate the value of the Path Kernel by approximating the integral with a direct sum The implementation details are reported in the following pseudo-code listings.In Figure 7 we indicate how to calculate the Neural Tangent Kernel of the predictor f once the parameter value w is fixed.In particular, the gradient can be calculated with the finite difference method or, if the predictor is implemented with a Quantum Neural Network, with the parameter-shift rule.
The procedure for calculating the Path Kernel is shown in Figure 8    In [20] the authors demonstrate that a two-layer-depth neural network with only a small number of neurons can easily outperform kernel methods on the Gaussian Mixture classification problem, under the assumption that the number of training data points n → ∞ is linearly proportional to the dimensionality of the data d → ∞.
We modify Refinetti's experiment for the current purposes to show the same result in a more straightforward way.
We define the two-layer neural network as the function: where h is the number of hidden neurons per layer (the number of hidden neurons is here fixed to h = √ d ).In our setting, we randomly initialize the weights W 1 , W 2 , W 3 by sampling the matrix element i.i.d.from a Gaussian of zero mean and unitary variance.The model is then trained using the gradientdescent-based algorithm ADAM for a maximum 1000 epochs with learning rate

Figure 1 :
Figure 1: Computation of the Path Kernel.Bottom left: A typical parameter trajectory γ is depicted, representing parametric evolution during the training phase.Top left: as θ evolves, it gives rise to differing NTK matrices, corresponding to distinct representations of the data.Such a sequence of matrices thus give rise to a hierarchical stack of representations in the feature learning regime.Middle: as the training approaches convergence, subsequent matrices become similar to each other, and thus their corresponding representations are correlated.Right: the Path Kernel constitutes the average over these representations.

4Figure 2 :Figure 3 :
Figure 2: Gaussian XOR Mixture classification experiment workflow.Machine-learning non-linearities such as those underpinning feature learning in empirical DNNs can thus be feasibly implemented in a quantum setting via the QPK.It remains to demonstrate that this can yield superior generalization performance on plausible quantum devices.Our evaluation, therefore, considers the reference case of the Gaussian XOR Mixture classification problem[42, 43,

Figure 4
Figure 4 indicates the respective convergence behavior of the evaluated quantum machine learning models with respect to the increasing number of layers.Column 1 has illustrative samples from the training distribution with row-wise decrements in the signal-to-noise ratio, column 2 gives the corresponding loss curves during training, and column 3 indicates the corresponding change in the

4. 2 . 2 Figure 5 Figure 6 :
Figure 5 indicates the corresponding test accuracies, measuring how well the respective models generalize to unseen data.While the QPK and Quantum NTK

Figure 7 :
Figure 7: Pseudo-code for the Neural Tangent Kernel formulation.

Figure 8 :
Figure 8: Pseudo-code for the Path Kernel formulation.

1 n
and uses the Neural Tangent Kernel to calculate the individual contribution of each training epoch and thereafter calculates the average kernel matrix pointwise.In Section 3.2 we discussed the potential significance of decorrelated features; we here propose a numerical implementation of the Effective Path Kernel.In contrast to the original Path Kernel, the Effective Path Kernel seeks to avoid to biasing due to multiple similar kernel contributions.This is especially important if the training has converged signficiantly earlier that the last training epoch: any contribution after convergence has the same Neural Tangent Kernel and will procedure CreateEffectivePathKernel Input: predictor function f : R d × R p → R, data set {xi ∈ R d } n−1 i=0 , number of training epochs t, parameter path γ ∈ R p×t obtained during the gradient descent-based training phase, correlation threshold C ∈ [0, 1].Output: real symmetric matrix n×n representing the Effective Path kernel of f over the given dataset.ℓ ← CreateEffectivePathKernelRec(f, {xi} n−1 i=0 , γ, C, 0, t − 1) n ← number of elements in the list ℓ return predictor function f : R d × R p → R, data set {xi ∈ R d } n−1 i=0 , parameter path γ ∈ R p×t obtained during the gradient descent-based training phase, correlation threshold C ∈ [0, 1], start instant ts ∈ [0, ...

Figure 9 :
Figure 9: Pseudo-code for the Effective Path Kernel formulation.

Figure 10 :
Figure 10: Comparison of the performance of Random Feature Kernel and (2 layer) Neural Networks over the 3D Gaussian XOR Mixture problem with an increasing number of features set to zero.10a, 10b, 10c, 10d, 10e, 10f have respectively 4, 8, 12, 16, 20, 24 feature per point, the first three being the only non-zero ones.

Figure 11 :
Figure 11: Values of the W 1 matrix for individual neural networks of the form of Equation 41 during training on the Gaussian XOR Mixture datasets D 24,3,0.8,384: 11a, 11b, 11c represent the coefficients at initialization, after 250 training epochs and after 750 training epochs of training with ADAM at a learning rate 0.001.