Evaluation of Parameterized Quantum Circuits with Cross-Resonance Pulse-Driven Entanglers

Variational Quantum Algorithms (VQAs) have emerged as a powerful class of algorithms that is highly suitable for noisy quantum devices. Therefore, investigating their design has become key in quantum computing research. Previous works have shown that choosing an effective parameterized quantum circuit (PQC) or ansatz for VQAs is crucial to their overall performance, especially on near-term devices. In this paper, we utilize pulse-level access to quantum machines and our understanding of their two-qubit interactions to optimize the design of two-qubit entanglers in a manner suitable for VQAs. Our analysis results show that pulse-optimized ansatze reduce state preparation times by more than half, maintain expressibility relative to standard PQCs, and are more trainable through local cost function analysis. Our algorithm performance results show that in three cases, our PQC configuration outperforms the base implementation. Our algorithm performance results, executed on IBM Quantum hardware, demonstrate that our pulse-optimized PQC configurations are more capable of solving MaxCut and Chemistry problems compared to a standard configuration.


I. INTRODUCTION
Looking at various limitations in current noisy quantum hardware, one might first think that the development of such systems at this stage relies heavily on quantum hardware engineers and experimental physicists. However, algorithm designers have successfully contributed to pushing limitations such as limited numbers of qubits, limited qubit connectivity, and coherence times by designing algorithms tailored for such systems. For example, the Variational Quantum Algorithm (VQA) employs a quantum-classical approach to counter current device limitations.
The general framework of VQA begins with identifying a problem-specific cost function. Next, a trainable Parameterized Quantum Circuit (PQC) or ansatz is used to evaluate this cost. This PQC is then trained in a hybrid quantum-classical loop that tries to minimize the cost. By pushing the parameter optimization load to the classical optimizer, VQAs are able to run short-depth circuits and hence are very suitable for current machines [1].
Two of the most prominent examples of VQAs are the Variational Quantum Eigensolver (VQE) [2], and the Quan-This work was supported in part by the QISE-NET NSF Fellowship DMR  tum Approximate Optimization Algorithm (QAOA) [3]. With VQAs being one of the most promising candidates for demonstrating advantage, and with various companies and institutes releasing devices with 10s-100s of qubits, VQAs have become one of the most investigated topics in quantum computing research. They are established as major quantum workloads, with researchers proposing optimizations to their implementation through all layers of the quantum computing stack [1].
In this work, we demonstrate how a deeper understanding of quantum device control can impact VQAs. We optimize VQAs by exploring the lowest level of quantum control: pulselevel access [4], [5], targeting the most integral part of the algorithm, its ansatz. There exist several ansatz architectures: ones that are problem-specific and others that are problemagnostic creating Hardware Efficient Ansatz (HEA) [6]. We focus mainly on HEA in this paper.
One major problem VQAs encounter is the occurrence of barren plateaus in PQC training. It has been proven that if an anstaz is sufficiently random, the gradient of the cost function vanishes exponentially with the number of qubits [7]. Therefore, the majority of studies on PQCs [7]- [18] focus mainly on trainability and optimization procedures, and lesser attention is given to their device-specific performance and optimization. More recently, hardware-oriented analysis and optimization of PQCs has been explored in [19]- [28]. A core design challenge in this direction is realizing the degree to which hardware should influence the algorithm implementation without sacrificing performance [7]. Thus, our goal in this paper is to identify a suitable combination of algorithm and hardware metrics that can guide pulse-level VQA optimization approaches.
From the hardware side, we utilize Hamiltonian Tomography (HT) [29], an accurate Hamiltonian calibration technique, to characterize our pulse implementations. We utilize HT to benchmark and customize the cross resonance (CR) gate [29]- [31], the entangling gate used by IBM's superconducting backends. We utilize the results obtained from HT to analyze our PQCs for the algorithmic descriptors: expressibility [32], entanglement entropy, and trainability [7]- [9]. Our pulsedriven PQCs achieve a speedup of up to 2.9x with an average of 2.51x over a base PQC design. We demonstrate VQE performance for MaxCut and Chemistry benchmarks on IBM's 27-qubit machine ibmq montreal, accessible through the IBM Quantum cloud service. Our algorithm performance results show that in at least three cases, the pulse-driven PQC configurations outperform the base PQC in trainability and solution quality.

A. Quantum Computing Basics
One of the essential differences between classical and quantum computing algorithms is that the latter are intrinsically probabilistic models. The quantum measurement operation collapses the "definite" state of a qubit in a two-dimensional Hilbert space to one of the two computational basis states. Quantum operations or gates are used to manipulate/modify information stored in qubits. A gate is defined by a unitary operation that can be considered as a rotation over the Bloch Sphere [33] and can either act on single or multiple qubits. The physical implementation of gates depends largely on the type of quantum hardware. For example, in IBM superconducting quantum computers, microwave voltage pulses are applied to qubits [34], [35] to implement the gates. The same principle is used to implement entanglement between qubits, which results in non-classical correlated effects [34].
Today's implementations of quantum workloads are constrained by limitations in current noisy quantum hardware. In the past decade, tremendous efforts have been made to improve the fidelity of quantum hardware, along with algorithms specifically targeting current and near-term machines. A major class of such algorithms is the variational quantum algorithm (VQA).

B. Variational Quantum Algorithms
A Variational Quantum Algorithm (VQA) is a hybrid scheme of computation that allocates tasks to both quantum and classical computing resources and coordinates the execution between the two through a tight feedback loop to achieve a larger computational goal. In contrast to quantum algorithms developed for the fault-tolerant era, VQAs are highly suitable for current noisy quantum hardware. This suitability stems from utilizing classical optimizers for parameter tuning, which helps keep the quantum circuit depth shallow, hence mitigating noise.
The algorithm's modular structure and suitability to current and near-term systems have led to its widespread use. In fact, exploring various aspects of VQAs is a key part of the research on quantum systems, and identifying the conditions under which this class of algorithms will succeed is still an open question [36]. VQAs have been applied to a wide variety of applications [1] such as quantum chemistry [6], [37]- [39], combinatorial optimization [40]- [42], and machine learning [43]- [47]. A complete discussion of VQAs can be found in the review paper by Cerezo et al. [1].
A prime example of VQAs is the Variational Quantum Eigensolver (VQE) [2] shown in Fig. 1. The trial wave function ψ θ is generated by applying the PQC (U ( θ)), which is expected to explore the Hilbert space efficiently. Once the trial state is prepared, the expectation value of the problem Hamiltonian H is determined. The Hamiltonian first needs to be decomposed or "mapped" from its original form (e.g., fermionic modes) to spin (Pauli) operators in a way that preserves the commutation relations [48]. Once decomposed, H can be represented as H = a i P i , where Pauli string P i is the tensor product of Pauli operators.
VQE utilizes classical optimization to find suitable parameters for the PQC, with the goal of minimizing the expectation value of H. The variational principle guarantees that the expectation value H is always greater than the minimum eigenvalue of the system (the ground state energy E 0 ). The classical optimizer is applied iteratively to update the PQC parameter set ( θ), and a quantum computer is used to compute information about the Hamiltonian's expectation value for the calculated θ based on the measurements. The algorithm is repeated until convergence, or an optimizer limit is reached. Various types of optimization procedures, such as gradient descent algorithms or direct search methods, can be used to update the circuit parameters [1].

C. Parameterized Quantum Circuits
A Parameterized Quantum Circuit (PQC) or ansatz is defined as a tunable unitary operation U ( θ) that is applied to a quantum state |ψ 0 , often initialized to |0 ⊗n [32] or a problem-influenced initial state. This results in the quantum state where θ is a vector of a polynomial number of circuit parameters. These parameters can represent any tunable feature of a quantum operation, but they usually correspond to angles of rotation gates. A PQC can be further decomposed to a product of L sequentially applied sub-unitaries [1], usually referred to as layers This modular nature of PQCs has been recently compared to classical computing, in which the parameters of the PQC are analogous to the weights and biases of a classical neural network [32]. Similar to the broad spectrum of neural network architectures, PQC designs can vary widely in their design goals and performance. Nonetheless, they can be classified into two main types: a problem-specific approach that utilizes knowledge of the problem to tailor the PQC architecture [19], [49], [50], and a problem-agnostic or hardware-efficient design [6] that focuses more on the suitability of the design to hardware [1]. This paper focuses mainly on the latter type, hardware-efficient PQCs, and expands on various aspects of their design.
Hardware-efficient PQCs generally aim at reducing both gate count and circuit depth. A single layer in this approach is usually composed of single-qubit operations followed by entangling two-qubit operations based on the physical connections of the hardware. A multi-layer PQC with this approach has been shown to be more suitable to current noisy machines compared to unitary coupled-cluster ansatze [6]. Fig. 2 shows examples of layer designs following this strategy. Additionally, other device-specific information such as gate decomposition, physical connections between qubits, crosstalk, and other noise characteristics can also influence the design choices for hardware-efficient PQCs.
Besides their structure, classifying and understanding the usefulness of different PQC designs is necessary for better VQA design. The next section (II-D) expands further on the topic.

D. Expressibility, Trainability, and Entanglement
With the wide range of PQC architectures, a fundamental question is whether a circuit can adequately prepare the target quantum state. In this regard, researchers have proposed different metrics to estimate the quality of an ansatz [32], [43], [47], [51]- [53]. In this section, we describe the three qualitative metrics we used in this paper to estimate a PQCs expressibility, trainability, and entanglement. 1) Expressibility: Proposed by Sim et al. [32], Expressibility (Expr) is defined as a PQC's ability to produce quantum states that well represent the Hilbert space. The general idea is to compare the distribution of states obtained from a PQC's U ( θ) to the maximally expressive uniform (Haar) random states. By sampling pairs of parameter values and their associated quantum states (e.g. using U ( θ 1 ) and U ( θ 2 )), we can compute the probability distribution of the quantum state fidelitiesP PQC (F ; θ) [32]. Expr is then estimated using the Kullback-Leibler divergence D KL [54] as follows where P Haar (F ) is the probability distribution of fidelities for the Haar random state. Please refer to the paper [32] for more details on this metric. In short, a smaller Expr value for a PQC indicates a closer approximation to random states and, hence, a more expressible circuit.
2) Trainability: A more expressive ansatz does not necessarily lead to better VQA performance. It is also essential to characterize the properties of the VQA's optimization landscape and employ efficient training routines to guarantee performance. Perfectly expressive ansatze are actually proven to have flatter optimization landscapes and thus are less trainable [53]. The first work to investigate the trainability of PQCs was by McClean et al. [7]. Their work proved that a wide variety of PQCs, particularly hardware-efficient ones, suffer from vanishing gradients exponentially in the number of qubits -a phenomenon known as barren plateaus. This observation has been further expanded by Cerezo et al. [8] to indicate that the occurrence of barren plateaus is cost-functiondependent for shallow ansatze. In recent years, more works have shown that other factors can also impact barren plateaus, such as noise [28] and entanglement [9], [10].
A general definition of a cost function can be where U ( θ) is the ansatz unitary acting on state |ψ , and O(ω) = i ω iÔi is an observable or Hamiltonian that acts nontrivially on a subset of the circuit qubits (local) or the total circuit qubits (global). The classical learning algorithm minimizes C by updating a parameter θ i through the use of the partial derivatives (i.e., ∂C ∂θi ) which represents the contribution to the gradient ∂C from the change in parameter ∂θ i . In Section IV, we utilize cost-function-dependent barren plateau analysis [8] to evaluate our PQC's trainability. We use V ar[∂ i C], which represents the variance of the partial derivative of the cost function C with respect to θ i for n sampled circuits. The magnitude of the variance quantifies the partial derivative's concentration around zero [8]. Thus, smaller values indicate less trainability.
3) Entanglement: Entanglement measurement quantifies the amount of entanglement contained in a quantum state. It is first essential to realize the advantages of generating highly-entangled states for VQAs. Prior works have shown that highly-entangled PQCs are potentially more capable of capturing non-trivial correlations in the quantum data and efficiently represent the solution space for tasks like the ground state preparation or data classification [6], [32], [44], [55], [56]. On the other hand, excessive entanglement can possibly lead to concentration of measure, making a PQC too random and less trainable [10]. In recent works, entanglement has been investigated as a primary source of barren plateaus [9], [10]. With such tradeoffs between entanglement and trainability, optimization problems vary in their utilization of entanglement for performance [11], [57]- [59]. This ultimately leads to the importance of developing a comprehensive understanding of the role of entanglement in VQAs.
There exist several methods for quantifying entanglement [32], [60]- [62]. In this paper, we use the bipartite entanglement entropy, which is the Von Neumann entropy of the reduced density matrix of any of the subsystems, to estimate the spread S of circuit entanglement where ρ α is the reduced density matrix of (n−1)/2 connected qubits containing as many cost function qubits as possible [9].
In Section IV, we analyze the entanglement of our PQCs by observing both S and HT results and tie this to their trainability with respect to cost function size and number of layers.

E. Pulse-Level Control of Quantum Systems
The lowest level of control of a quantum computer is through pulses. Such a level of control can be realized/enabled by a classical microprocessor with an embedded pulse digitalto-analog converter [4]. A pulse is defined as a time-series of complex-valued analog amplitudes, each called a sample, applied to qubits on any type of input channels, at each system cycle time dt [4]. Typically, the timing of scheduled operations is inconsequential in the standard quantum circuit model as long as the order of non-commuting operators is preserved [4], [63]. However, timing considerations are very critical once we move to the pulse model on quantum hardware such as transmons [4], [5].
Quantum computers are routinely calibrated to account for drifts in their state by updating their experimental parameter settings [34], [64], [65]. Such calibrations are key to obtaining the translations from gates to pulses or pulse schedules. For example, current IBM quantum backends implement the X gate as an almost-Gaussian DRAG pulse [66] (with a carrier frequency equal to that of the ground-to-excited state transition), while Z and R Z (θ) gates are purely implemented in software [64]. Figures 3(a) and (b) show a quantum circuit and its corresponding pulse schedule. We demonstrate pulse control of quantum systems using IBM's framework for pulselevel access, Qiskit Pulse [4], [5]. Table I shows a summary of the different pulse channels used in IBM machines and their descriptions.

A. Dissecting the Cross Resonance Gate
The Cross Resonance (CR) gate is an all-microwave entangling gate, obviating the need for tunable qubits or couplers [29]- [31], [67]. This feature makes for better scaling to larger numbers of qubits by minimizing the overhead of control electronics and control wires [68], [69]. Thus, the CR gate emerged as a promising two-qubit entangling gate in quantum architectures based on planar, fixed-frequency superconducting transmons [29], [69]. Transmon qubits are designed to have reduced sensitivity to charge noise while maintaining sufficient anharmonicity, allowing the lowest two levels to be addressed as a qubit [70].
For a pair of coupled fixed-frequency transmons, a CR interaction is realized by driving the control transmon at the frequency of the target transmon. This interaction produces an effective Hamiltonian of the form [4] where each term represents Pauli operators applied to both control and target, with the control being first and the target being second in the tensor product reading from left to right. For example, the term ω IZ IZ 2 corresponds to Pauli-I and Z operators applied to the driven control and target qubits, respectively, generating an uncontrolled (because of the Pauli-I on the control) Z-rotation on the target transmon of strength ω IZ .
If isolated, the ZX conditional rotation term in (6), with rotation angle π 2 , would result in the unitary U ZX ( π 2 ) = e −i π 4 ZX , which is locally-equivalent to the standard CNOT (i.e., equivalent up to single-qubit gates). The CNOT gate is sufficient for universal quantum computation when combined with arbitrary single-qubit operations [4], [69]. However, the other terms in the effective Hamiltonian (6) are coherent error terms and "unwanted" for generating the unitary equivalent to CNOT. Developing strategies to characterize and control these terms in order to create high-fidelity entangling gates is still ongoing research [4], [69]. Figures 3(a) and 3(b) show IBM Quantum's standard echoed technique used to suppress these terms and implement the CNOT gate. This pulse sequence is comprised of three main components: • Echoed CR Pulses: two CR pulses with opposite phases on the control channel (u) and two single-qubit pulses on the drive channel, one before each CR pulse. This sequence (grouped with the echoed CR pulses) refocuses/echoes away unwanted terms (mainly IX and ZI) in the interaction Hamiltonian [4], [71]. • Compensation Pulses: also known as target rotaries, these are used to address and reduce errors identified in the echoed CR Hamiltonian arising from driven ZZ interactions and classical crosstalk (IY ). Additionally, they suppress unwanted entanglements with target nearestneighbors or spectators due to static coupling, without increasing the CR pulse length [69]. • Single Qubit Pulses: Additional single-qubit pulses on the control and target are used to build a CNOT from the generated U ZX (π/2) unitary. This sequence can be "reversed" in that the physical CR control qubit may be a logical CNOT target qubit with the appropriate addition of single-qubit gates.
In the next section (Section-III-B), we use our understanding of the CR Hamiltonian and pulse sequence to implement a pulse-efficient entanglement gate suitable to VQAs. Transmit channel associated with arbitrary interaction between specific 2 qubits j and k AcquireChannel a i Connected to the readout component of qubit i to digitize and acquire measurement data  Table II shows Qiskit's gate decomposition of some of the available two-qubit gates on IBM Quantum devices. We notice that CNOT is a base of these decompositions since it is a basis gate for IBM backends. Our main design principle is that PQCs do not necessarily need such standard two-qubit gates, but the goal is to use two-qubit entangling gates in general [6]. Therefore, we utilize pulse-level access to quantum systems to show the pulse schedule of two custom entanglement gates: CR(π/4) and CR(150ns) respectively. CR(π/4)'s implementation is based on the standard CNOT shown in Fig. 3(b), where each pulse's amp and duration are carefully calibrated for each qubit. For example, the CNOT's CR tone directions intentionally avoid qubit-qubit collisions: accidentally driving terms with control-spectators [72]. Moreover, its pulses are calibrated to the largest amplitude without noticeable leakage in order to reduce the duration which the qubit is subject to decoherence [64], [73]. Thus, as its name suggests, CR(π/4) uses the first CR tone in the CNOT's echoed cross-resonance sequence to achieve a ZX rotation of π/4 (and uncancelled single-qubit rotations). We chose a fixed duration of 150ns for our second entangling gate CR(150ns) (similar to the gates used in [6]), with the goal of minimizing the effect of decoherence without compromising the optimization accuracy. The fixed duration is the average duration for the CR(π/4) gate for different backend pairs. For the rest of pulse parameters in both CR gates, we chose to utilize the daily calibrations performed on IBM quantum devices. In this paper, we demonstrate how straightforward customized pulse gates, along with PQC analysis for parameters such as trainability, can lead to better algorithm performance. In future work, will explore optimizing the CR pulse parameters with different ansatze and study their correlation with trainability more closely.

B. Custom Entanglement Gate Implementations
We generally modify the standard CNOT implementation in the following way: can significantly enhance performance [74]. Second, the echoed CR sequence (as mentioned in Section III-A) cancels uncontrolled single-qubit rotation terms such as IX and ZI in the CR Hamiltonian. We make the case that such terms are unwanted when the target unitary is CNOT, but not in our case. • We removed the target rotary pulses (target qubit pulses).
As mentioned in Section III-A, these rotary echoes are used to suppress the driven ZZ interaction and entanglements with target spectators [75]. We argue that entanglements with the target's spectators are not necessarily detrimental to the VQAs and this should be further explored. In Section IV, we use Hamiltonian Tomography (HT) to extract unitary representations of our custom gates, assuming a block-diagonal cross resonance Hamiltonian.

C. Characterizing CR-based Gates
Characterizing the pulse gates is essential to understanding their components and performance. For this purpose, we used Hamiltonian Tomography (HT), an accurate Hamiltonian calibration technique developed by Sheldon et al. [29], to estimate the coefficients (strengths) ω(s) of the CR Hamiltonian terms in (6).  show the gate and pulse sequence for HT. The experiment is performed by applying a CR tone (or the echoed CR tones for the standard implementation) with different durations. The target qubit is then measured by projecting to the X, Y , and Z bases, with the control qubit either in the 0 or 1 state. The measurements (from the resulting six sets of experiments) are of the expectation values of each term in the Hamiltonian. It is important to note that HT is not sensitive to the ZI term arising from a Stark shift (an off-resonant drive that dressed the qubit frequency) because the control qubit is in an eigenstate of the Z operator. Thus, an additional Ramsey experiment on the control qubit was performed to estimate the strength of this term. Estimating the CR Hamiltonian terms can also be used to extract the unitary representation of custom gate implementations according to Schrodinger's equation where H CR is our CR Hamiltonian, and t represents the CR tone's duration. This unitary can then be used to further analyze the PQCs for algorithmic descriptors such as expressibility, trainability, and entanglement.

A. Experimental Setup
We conducted our experiments on ibmq montreal, a 27qubit backend available through IBM Quantum Services. The backend has average T 1 and T 2 times of 84.24 µs and 85.32 µs respectively, and an average CNOT error rate of 4.703e−2. Note that these values fluctuate and are monitored through daily calibrations available through Qiskit. We utilize Qiskit Runtime [76], a programming model that allows for faster execution of quantum workloads on the cloud, to run our algorithm benchmarks. Fig. 5(b) shows one layer of a base n-qubit PQC design. We will refer to PQCs constructed using this layer as base PQCs. Fig 5(c) shows a single layer design utilizing our CRbased entanglers. We refer to PQCs utilizing these gates as Customized Pulse PQCs or CP. We refer to PQCs that use the CR(π/4) gate as CP ang, and PQCs utilizing CR(150ns) as CP dur. We used a linear entanglement arrangement in both circuits, which applies two-qubit gates to neighboring qubits only. The mapping of the circuits on ibmq montreal's topology is shown in Fig. 5(a).
In this section, we analyze the three PQC designs' circuit duration, expressibility, trainability, and entanglement. Next, we evaluate their performance for a set of chemistry and MaxCut problems.

B. Circuit Duration
We analyzed the three PQC configurations for total gate count, circuit depth, and duration. Since the three configurations share the same structure, they have identical gate counts and circuit depth (not shown). This was expected as our method only changes the pulse implementation of the entanglers and does not change the circuits' structure.
To measure the duration of base, we compiled the circuit with the three levels of optimization available in Qiskit and picked the lowest duration. We left the measurement operation out of our speed calculations. As we mentioned in Section-III-B, optimizations leading to faster quantum circuits are crucial as we are still competing with limited qubit coherence times. It is also critical as it gives more freedom to perform measurement pulses.
We observe a speedup of up to 2.28× in the execution time of CP ang compared base, with an average speedup of (2.2×). CP dur on the other hand observes a maximum speedup of (2.9× over base, with an average of (2.8×). This is a direct result of using faster two-qubit entangling gates. As shown in Figures 3(c) and 3(d), the custom gates are at least (2×) faster compared to standard CNOT. This reduction in duration is essentially equivalent to reducing the number of layers by half, as the echo pulse and subsequent CR(π/4) are removed. As the CR(150ns) use a fixed duration compared to CR(π/4)'s calibrated duration, CP dur observes an average speedup of (1.28×) over CP ang. Table III shows the characterization of the CR tones using HT. As expected, the CR tones have a higher strength of the ZX entangling term compared to other Hamiltonian terms (except for ZI). As our CR implementations do not use an echoed pulse implementation, we see a high frequency for the ZI term. However, this doesn't affect the algorithm performance as VQAs are unaffected by coherent terms which can be dealt with by the optimizer. Notably, the CR tones also experience a high frequency of the IX term as a result of not using the echoed CR sequence.

C. CR Gates Characterization
As mentioned in Section III-C, this characterization was used to obtain the unitary representation of our pulse gates according to (7) by substituting t with the appropriate pulse duration (i.e. 150ns for CR(150ns) and the calibrated duration   for CR(π/4)). The unitaries were then used to analyze the PQCs in the following sections for expressibility, entanglement, and trainability.

D. Expressibility
We used state-vector simulation to perform the necessary sampling for expressibility calculation, as detailed in Section-II-D. As we mentioned in Section IV-C, the unitary representations of the CR-based gates were used in the sampling of the CP PQCs. Fig. 6 shows the expressibility of the the three PQC configurations with varying numbers of qubits and layers. As mentioned in the background, a lower value means better expressibility for the circuit using the KL divergence measure. For PQCs with a number of layer (L > 1), we see that the base configuration is more expressive than CP. The inset of Fig. 6 shows the average increase in expressibility of base over CP as a function of L. base observe a higher average increase in expressibility with shallow numbers of layers, with a maximum of 24% over CP PQCS at L = 3. The difference in expressibility gradually decreases as we add more layers and expressibility values saturate.
This reduction in the CP PQCs' expressibility is not exactly harmful to the performance. As we mentioned in Section II-D, findings from Holmes et al. [53] indicated that the more expressive the PQC, the smaller the variance in cost gradients and hence, the harder it is to train. Their results also suggest that ansatze need not be highly expressive; instead, it is more important that they are trainable and contain a solution to the problem. With that, the CP configuration proves to be more trainable (Section IV-F). Our algorithm performance results (Section IV-G) further confirm that this reduction in expressibility does not harm the algorithm performance and can, in fact, optimize it. Fig. 7 shows the trend of entanglement entropy for the three PQC configurations with increasing circuit depth. The results are obtained for 9-qubit PQCs with a 4-5 partition, as shown in the figure. We see from the trend lines that base always creates more entanglement compared to CP PQCs. The base configuration has an entropy that is, on average, 2.58× higher than CP's across all circuit depths. Such reduction in entanglement is expected due to the short durations of the CR tones used in CP compared to CNOT. We also see that the entropy difference drops as we increase the circuit depth before the values reach saturation.

E. Entanglement
In regards to capturing the PQCs entanglement more accurately, this can be further improved by accounting for spectator entanglements as well. As mentioned in Section III-A, the CR interaction on transmons can also generate coherent terms due to coupling with the target's nearest-neighbors or spectators. As we also mentioned, the target rotaries in the echoed CR sequence ECR are proven to suppress this type of entanglements [69]. Thus, our CR( π 4 ) pulse can possibly have more spectator entanglements, which can lead to entirely different entanglement dynamics. Accounting for spectator interactions, however, requires additional experimentation. This can be done by using generic quantum tomography techniques or, more favorably, the Hamiltonian Error Amplifying Tomography (HEAT) technique proposed by Sundaresan et al. [69].

F. Trainability
To analyze the PQC configurations' trainability, we follow a cost-function-based analysis similar to that in [8] and [9]. We used a simple ground state preparation problem, which can be defined by the global cost function where N is the total number of qubits, and p |0 ⊗N is the probability of measuring the |00...0 N state. For the local cost function, we only consider the probability of a subset of qubits where N C is the number of cost-function qubits. It is interesting to point out that for (N C = 1), C L has a cost landscape similar to that of a local cost function acting on each qubit separately [78]. Fig. 8(a) shows the results for different cost function and PQC settings. The bottom two lines (Deep, C G ) follow the conclusions from [8] that this cost function, and others like it, exhibit barren plateaus. The figure also proves that local cost functions like C L will exhibit barren plateaus for deep numbers of layers. We see that base observes better variance for small numbers of qubits, but both curves are exponentially decreasing due to barren plateaus. More interestingly, we see that CP ang has better local cost function trainability with shallow layers. This observation is further expanded in Fig. 8(b) for the three PQC configurations, which confirms it for different values of N C (up to a certain limit). The results also suggest that the performance gap (indicated by the shaded regions) shrinks with increasing N C . This better overall local cost function trainability can be attributed to the CP PQC's reduced expressibility, entanglement, and duration; as each of these parameters is proven to negatively affect training [9], [28], [53].
To further explore the advantages of local cost function training, we compare the VQE optimization performance of two different 4-qubit Hamiltonians for the H 2 molecule. The Hamiltonians (H JW , H BK ) were obtained using two of the most commonly used techniques to map fermionic to spin operators: Jordan-Wigner (JW) [33] and Bravyi-Kitaev (BK) [79]. As discussed in [8], [80], BK mapping often leads to more local Pauli terms and hence to more trainable cost functions. Fig. 9 shows the results from the experiment ran on ibmq montreal. Contrary to our expectations based on the local cost function analysis, CP dur performs poorly with the BK mapping compared to other PQC configurations and its JW performance. This is an important finding as it reveals that other factors (yet to be determined) besides the locality of the cost function affect CP performance. On the other hand, we see a larger performance gap between CP ang and base for BK compared to JW mapping (the shaded areas in the figure), indicating that the performance was affected by the locality of the Hamiltonian. Although this somewhat confirms the results from Fig. 8 showing that CP has better local cost function trainability than base, the two Hamiltonian mappings had similar performances for each PQC configuration. Additionally, both mappings fall short in terms of performance compared to an H 2 mapping that uses 2-qubits (Section IV-G). Overall, we believe that utilizing efficient local cost function implementations can lead to better performance (as proven in [80]) using pulse-optimized gates, and we leave this exploration for future work.

G. Algorithm Performance
We compare the performance of the three PQC configurations with two sets of VQE applications from chemistry and optimization. In chemistry, we use VQE to find the ground state energy of the H 2 , LiH, and BeH 2 molecules, which corresponds to finding the minimum eigenvalue of Hermitian matrices characterizing these molecules. For optimization, we solve three MaxCut problems (shown in Fig. 11). We ran all our benchmarks on ibmq montreal accessed through IBM Cloud and configured our experiment as follows. We used the Simultaneous Perturbation Stochastic Approximation (SPSA) [77] as our optimization routine, with the maximum number of iterations set to 100. We use an (R Y R Z ) rotation (instead  Jordan-Wigner 1. 5 6 of the R Y shown in Fig. 5) for chemistry benchmarks. The number of layers was set to 5 for both CP and base across all applications.
1) Chemistry Benchmarks: Fig. 10 shows VQE results for three chemistry molecules: H 2 , LiH, and BeH 2 . The Hamiltonians were obtained through Qiskit's integration with the PySCF library [81]. We favor reducing the number of qubits guided by a quick analysis of trainability for H 2 , which revealed that H 2 's 2-qubit mapping has a variance in partial gradients var[ ∂C ∂θo ] that is 3× higher than that for the 4-qubit mapping. Therefore, we chose Jordan-Wigner and Parity [79] mappings to map our molecules' fermionic operators to spin operators. The Parity mapping was chosen for the H 2 and LiH, as it allowed for reducing the number of qubits by utilizing Z 2 symmetries. Table IV summarizes the experiments' configurations for each molecule. Fig. 10(a) shows the results for the H 2 molecule. Both base and CP ang PQCs fail to find the lowest energies, but their results are fairly and equally close to the exact solution while CP dur performs slightly worse, which indicates a lower quality for this PQC with small configuration. For the LiH molecule shown in Fig. 10(b), both base and CP PQCs results are fairly close to the exact solution, with CP dur being slightly closest to the exact solution. For the 6-qubit BeH 2 problem shown in Fig. 10(c), CP PQCs clearly outperform base, with the lowest energy obtained through CP ang reaching chemical accuracy (defined to be within 0.0016 Hartree of the exact result). This result indicates that the CP configurations may have better potential with larger and more complex problem structures. 2) MaxCut: For MaxCut benchmarks, the optimized set of parameters obtained by VQE was first used to prepare a quantum state through the PQC. This state was then sampled to construct an eigenstate, from which the highest probabilities correspond to MaxCut solutions (graph partitionings). The solutions can then be evaluated by calculating their cut values and comparing them to a classically calculated MaxCut reference. Table V shows the number of correct Maxcut solutions out of the top−5 solutions for each PQC configuration. The table also shows results using the Rank of Correct Answer (ROCA) metric proposed by Tannu et al. [82], which, as its name suggests, accounts for the order of appearance of the correct answer(s).
We see that CP ang generally performs better than base and CP dur, specifically for the 3and 9-node problems. This is evident for the 9-node case, where CP ang configuration was capable of finding the correct solution with a ROCA of 2 compared to 0 correct solutions for the two other configurations.
In conclusion, we see that the CP configurations have, on average, a better algorithmic performance compared to base, specifically for larger problem instances (6-qubit BeH 2 molecule and 9-node MaxCut). The CP's reduction of expressibility, entanglement, and duration prove beneficial to the algorithm's performance and trainability. We also observe that CP ang performs better than CP dur in general, which shows the sensitivity of the algorithm's performance to tuning the CR pulse. We argue that optimizing pulse parameters, alongside utilization of efficient local cost-functions can lead to further improvements.
Hardware-efficient PQCs have been first proposed by Kandala et al. [6]. Their work used fixed-duration entanglers to simulate the performance of VQE for small molecules and quantum magnets. In this work, we extend their usage of Hamiltonian tomography by utilizing its data to analyze our PQCs for expressibility, trainability, and entanglement. Recent studies have explored hardware-oriented VQAs' analysis and optimization. Ravi et al. [22] proposed a VQA error-mitigation approach that tunes single qubit gate scheduling and dynamical decoupling sequences in a variational approach. Other works have also explored the effects of noise on VQAs and hardware-efficient PQCs [19], [84], [85]. The work by [85] determines optimal PQC depth at different noise levels and investigates the circuit resiliency to noise with the inclusion of redundant parameterized gates. Zeng et al. [84] simulates specific hardware-efficient PQCs' performance with different noise models and noise levels. Their results showed that VQE's performance degrades as the noise probability or the circuit depth increase. A more recent study by Saib et al. [19] discussed the effect of noise on chemistry applications and profiled various PQCs for expressibility. Their results suggest that expressibility is weakly correlated to VQE performance. We note that the original expressibility and entanglement paper by [32] states that it has not yet discovered an accurate correlation between these measures and VQE applications. Our work aims to uncover ways to merge PQC descriptors, such as trainability, expressibility, and entanglement, to hardwarespecific parameters in PQC design.
More recently, VQA optimization through Quantum Optimal Control (QOC) has gained more attraction [20], [21], [25], [86]. For hardware-efficient gate-based PQCs, Liang et al. [23] proposed a pulse optimization framework that manipulates the PQC gate amplitudes as part of the VQA optimization routine. In contrast to their approach, we choose to preconfigure our CR pulse parameters and not attach them to the VQA optimization procedure, as it was proven in [6] through numerical simulations that accurate optimizations can be obtained for fixed-phase two-qubit gates. Additionally, as over-parameterization of pulses can lead to difficulties in optimization [25], our approach leads to a lower number of parameters and, ultimately, a faster VQA implementation as the circuit size grows. A more recent work by the same group [87] proposes a progressive pulse-ansatz construction and learning approach utilizing non-gradient optimizers to generate more scalable and efficient anstaze.
Utilizing pulse access for faster two-qubit gate implementation has been explored by [64], [74], [88], [89]. Jurcevic et al. [74] experimented with a direct CNOT approach that uses compensation mechanisms different than the echoed CR implementation to achieve a higher quantum volume of 64 on IBM machines. Gokhale et al. [64] utilized OpenPulse and knowledge of the CR gate to implement a more efficient R ZZ rotation, which is a core operation for quantum chemistry and optimization algorithms. Their optimized implementation experienced both error rate and execution time reductions and has been adopted by Qiskit's transpiler, as shown in Table II. More recently, Stenger et al. [89] proposed a pulsescaling method that scales the area of the CR and rotary pulses to create R ZX (θ) rotations. Their method improves the gate fidelity with no additional calibrations. Their work has been further extended in [88] to arbitrary gates and to develop a pulse-efficient circuit transpilation framework, which decomposes two-qubit gates into the hardware-native R ZX rather than the CNOT-based transpilation.

VI. CONCLUSIONS
In this work, we utilize pulse-level access to quantum machines to alter the standard design of two-qubit gates. Additionally, we identify a suitable combination of hardware and algorithmic parameters that can be efficiently embedded in the design and are proven to impact performance. Our analysis results prove that our customized pulse implementations maintains similar expressibility to a standard PQC and is more trainable for local cost functions, all while reducing the circuit duration to half. Therefore, this implementation is more suitable for VQAs. Our algorithm performance results show that in at least three cases, our customized pulse PQC configuration outperforms the base implementation. As previous literature closely ties PQC parameters such as entanglement, noise, and expressibility to barren plateaus, we believe that pulse optimization, which directly impacts said parameters, is a very promising approach to enhance trainability. We leave this as our main future goal. Other next steps include designing a comprehensive entanglement model of the PQC by including spectator entanglements and further testing with a more diverse set of PQC architectures.
VII. ACKNOWLEDGEMENT M.I. would would like to thank the NSF QISE-NET Fellowship for funding through the grant DMR 17-47426, and the IBM Quantum Hub at NC State for access to ibmq montreal.