A Current-Mode Multiphase Digital Transmitter With a Single-Footprint Transformer-Based Asymmetric Doherty Output Network

This article introduces a current-mode multiphase digital transmitter with a single-footprint transformer-based asymmetric Doherty output network. The proposed multiphase architecture overcomes the bandwidth expansion associated with the polar power amplifier (PA), while still achieving relatively constant output power and drain efficiency (DE) profiles. Additionally, to achieve efficiency enhancement in deep power back-off (PBO), and to simultaneously achieve a compact form factor, an asymmetric series Doherty output matching network using a transformer-within-transformer structure is also proposed. A proof-of-concept eight-phase digital transmitter using the proposed single-footprint Doherty network is implemented in a general-purpose 65-nm CMOS process. The transmitter achieves more than 20-dBm output power <inline-formula> <tex-math notation="LaTeX">$(P_{\mathrm{ out}})$ </tex-math></inline-formula> and more than 31% DE from 4.5 to 6.7 GHz. At 8-dB PBO, it achieves a DE of 23% and 24% at 6.5 and 7.0 GHz, which corresponds to a <inline-formula> <tex-math notation="LaTeX">$1.76\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$1.93\times $ </tex-math></inline-formula> improvement compared to normalized class B PA, respectively. The transmitter also achieves a 21% DE and an average <inline-formula> <tex-math notation="LaTeX">$P_{\mathrm{ out}}$ </tex-math></inline-formula> of 14 dBm with an r.m.s. error vector magnitude <inline-formula> <tex-math notation="LaTeX">$({\mathrm{ EVM}}_{\mathrm{ rms}})$ </tex-math></inline-formula> of 4.1% for a 20-MSym/s 64-quadrature amplitude modulation waveform at 6.5 GHz.


I. INTRODUCTION
T HE SUB-7-GHz spectrum is popularly used for a variety of wireless communication networks, such as 4G long-term evolution (LTE), 5G new radio (NR), and Wi-Fi 6E. In recent years, digitally implemented transmitters have garnered a significant amount of interest for such applications [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], due to their ability to scale well with process nodes and their potential to achieve high efficiency while realizing a compact form factor [12]. However, most of these applications employ higher-order modulation schemes that result in a large peak-to-average power ratio (PAPR), which severely limits the efficiency improvement offered by a digital transmitter. Furthermore, the two popular digital transmitter implementations-polar and quadrature-suffer from additional limitations, as explained below. Therefore, in this work, a multiphase digital transmitter with a single-footprint transformer-based asymmetric Doherty output network is proposed to overcome these challenges.
A digital transmitter can be implemented either in currentmode fashion (generally using class D −1 switches) or voltage-mode fashion (using class D switches). In terms of linearity, a voltage-mode implementation is typically advantageous since it inherently produces a linear output amplitude response with respect to input code words. However, the class D switches require both nMOS and pMOS devices, resulting in increased input capacitance, which limits the operating frequency [13]. Additionally, the output capacitance of these switches is not part of the matching network, requiring them to be charged and discharged every cycle, which can lead to efficiency degradation, especially as the operating frequency is increased [1]. Therefore, a current-mode implementation is chosen to demonstrate the proposed techniques of this work.
In the literature, the polar power amplifier's (PA) need for a complex coordinate rotation digital computer (CORDIC), and the issue of bandwidth expansion related to the constant envelope phase modulated (PM) input signal is well known [1], [2], [3], [4], [5], [6], [7], [8], [14], [15], [16], [17], [18], [19], [20], [21]. These issues limit the overall achievable transmitter bandwidth. To solve this problem, a digital quadrature architecture has been explored [10]; however, it results in a 6-dB worst-case output power loss [22]. The IQ cell-sharing technique ameliorates this problem [9], [22], [23], [24], [25], [26]; however, it still results in a worst-case 3-dB output power loss. Taking advantage of a current-mode PA's nonlinearity, a diamond-shaped input code-word profile has been shown to achieve an almost constant output power profile [11]; however, this technique suffers from a degraded efficiency profile due to enormous in-phase/quadrature vector overlap, as shown in the next section. Utilizing a multiphase architecture can reduce this overlap, and thereby improve performance [27]. For a voltage-mode implementation, a multiphase approach has been shown to improve output power and efficiency profiles [28], [29]. In this work, a current-mode multiphase architecture is proposed to significantly increase the operating frequency, while also achieving an almost constant output power profile and improved efficiency profile.
In addition, given the extensive use of high PAPR modulated signals, efficiency enhancement in deep power back-off (PBO) is critical to improve the average efficiency of the transmitter. Numerous techniques, such as class-G [14], [18], [19], [22], [23], out-phasing [30], and subharmonic switching [31], have been explored in this regard; however, they are challenging to implement due to the need for complex power management circuits or complex baseband processing. A two-way Doherty network offers a potential solution; however, it limits the efficiency enhancement only to 6-dB PBO [3], [17], [32]. An asymmetric two-way Doherty network can overcome this limit to achieve efficiency enhancement in deep PBO [33], [34], [35]. Typically, such a network, for a current-mode PA, consists of at least two transformers [17], [36], [37], which occupy a considerable amount of on-chip area. Therefore, this work proposes a transformer-within-transformer structure for an asymmetric series Doherty network to achieve efficiency enhancement in deep PBO, while also achieving a compact form factor.
To highlight the proposed techniques, an eight-phase digital transmitter with a compact asymmetric series Doherty network is implemented in a general-purpose 65-nm CMOS process. It attains more than 20-dBm peak output power (P out ) and more than 31% maximum drain efficiency (DE) from 4.5 to 6.7 GHz. At 8-dB PBO, it achieves a DE of 23% and 24% at 6.5 and 7.0 GHz, which corresponds to a 1.76× where m cells are driven with φA, n cells are driven with φB , and the remaining [p − (m + n)] cells are turned off to achieve the desired output amplitude (|Vout|) and output phase (∠Vout). and 1.93× improvement compared to a normalized class B PA, respectively. The transmitter also achieves a 21% DE and an average P out of 14 dBm with an r.m.s. error vector magnitude (EVM rms ) of 4.1% for a 20-MSym/s 64-quadrature amplitude modulation (QAM) waveform at 6.5 GHz.
The remainder of this article is divided as follows. Section II provides a discussion on the design approach for a multiphase digital PA architecture, a wideband multiphase generation technique, and an asymmetric series Doherty output network, along with a derivation of the design equations for realizing the output network. The implementation of all the circuit blocks and the single-footprint transformer-based matching network is included in Section III. Continuous wave (CW) and modulated measurements are shown in Section IV, and conclusions are drawn in Section V.

A. MULTIPHASE ARCHITECTURE
The operation of the proposed multiphase digital PA architecture is depicted in Fig. 1. The PA consists of a total of p cells, where each cell is composed of a differential switch driven by one of the two differential basis vectors φ A /φ B . In Fig. 1, m cells are driven with φ A , n cells are driven with φ B , and the remaining [p − (m + n)] cells are turned off. The values of m and n can be changed dynamically during the operation of the PA, and the use of the cell-sharing technique allows all of the p cells to be driven entirely by either φ A or φ B , or with a combination of φ A and φ B , as long as to achieve the desired output amplitude (|V out |) and output phase (∠V out ). Consider the simplest implementation of such an architecture, i.e., quadrature architecture, where φ A is 0 • and φ B is 90 • . The cell-sharing technique, defined by (1), limits the input code-word profile to a diamond shape, as shown in Fig. 2(a). However, nonlinear summation of the basis vectors φ A /φ B can be exploited to obtain a relatively circular output profile [11]. When the cells are driven with a combination of φ A and φ B , it results in an overall increased on-time of the switches, which is equivalent to increasing the duty-cycle of a switching PA to more than 50%. This leads to a boost in the output voltage waveform, which results in a similar fundamental output power as that of 50% dutycycled switching PA [38], [39], [40]. Therefore, the output profile is relatively circular, as shown in Fig. 2 However, this nonlinear summation has severe implications on the maximum achievable efficiency of the PA at a given ∠V out . When the PA operates entirely along one of the basis vectors, i.e., m = p, n = 0 or m = 0, n = p, it achieves zero-voltage switching (ZVS), which results in maximum efficiency. However, the overlap between φ A /φ B causes the switches to turn on before the voltage waveform reaches zero, leading to efficiency degradation. This degradation gets worse as the overlap between the basis vectors increases, with the worst-case degradation occurring around m = n = p/2. The proposed multiphase architecture ameliorates this issue by increasing the total number of basis vectors to reduce this overlap.
To quantify the improvement offered by the multiphase architecture, the PA illustrated in Fig. 1 is simulated under the condition m = n = p/2, and the phase difference between φ A and φ B , i.e., φ, is swept from 0 • to 120 • , as shown in Fig. 3. Here, the switches are modeled using a simple  on-off resistance, and the results are in close agreement with switches implemented as cascode cells in a general-purpose 65-nm CMOS process. Note that φ = 0 is equivalent to operating entirely along one of the basis vectors. For the quadrature architecture ( φ = 90 • ), the maximum efficiency at the middle of the basis vectors (∠V out = 45 • ) exhibits a severe 0.55× reduction compared to operating along the basis vectors (∠V out = 0 • or ∠V out = 90 • ). However, this efficiency drop is 0.91× for an eight-phase architecture, and it is almost negligible (about 0.99×) for the 16-phase architecture, highlighting the advantage of the multiphase digital PA.
Given that the PA achieves > 0.9× of the performance for φ = 45 • , an eight-phase architecture is implemented in this work. Fig. 4 compares the efficiency of the eightphase architecture with its quadrature counterpart for the case m + n = p (i.e., |V out,max | contour). As expected, both the architectures achieve an identical maximum efficiency of 76% when operated along the basis vectors; however, the efficiency severely degrades for the quadrature architecture when ∠V out is around the middle of the two basis vectors, with the worst-case efficiency of 42%. On the other hand, the eight-phase architecture maintains a relatively high worstcase efficiency of 69%, achieving a 1.64× improvement. This improvement results in an increased average efficiency when the PA is used to transmit complex modulation schemes such as QAM.  The input and output code-word space for the eight-phase PA are shown in Fig. 2(c) and (d), respectively. Similar to the quadrature architecture, the eight-phase architecture also achieves a relatively circular output profile due to the nonlinear summation of the basis vectors. It is important to note that a 2-D lookup table would be needed during the modulation measurements to account for this nonlinearity. Another advantage of the eight-phase architecture over the quadrature architecture is the increased resolution in the code-word space. This is due to the total number of code-words being distributed in smaller sectors because of the increased number of basis vectors, as shown in Fig. 2. This improvement in resolution translates to an improved quantization noise floor [9], [41], [42].

B. MULTIPHASE GENERATION
In order to implement a transmitter based on the eight-phase architecture, a multiphase generator is used to produce eight equally spaced phase-shifted signals on-chip. The multiphase generation circuits need to operate over a wide range of frequencies to ensure that the input network of the transmitter does not become the bottleneck of the RF bandwidth, thereby allowing it to be determined by the more complex output matching network of the PA.
Architectures   transmitter would require an input frequency from 20 to 28 GHz, which is cumbersome. On the other hand, although, the PPF and the 2P-8ILRO architectures do not suffer from this requirement, their multiphase outputs exhibit a significant amount of error. For example, a simulation of an ideal (before parasitic extraction) PPF designed at 6.5 GHz followed by a bias tee and a NOT gate to generate railto-rail voltage swings exhibits > 6 • of error at 4.5 GHz, as shown in Fig. 6(a). Similarly, the simulation of an ideal 2P-8ILRO exhibits increasing error as the injected frequency deviates from the natural frequency of the ring oscillator. Additionally, the 2P-ILRO architecture also suffers from a relatively narrow locking range, as shown in Fig. 6 To overcome these challenges, an eight-phase injectionlocked eight-stage ring oscillator (8P-8ILRO) is proposed. In the literature, a four-stage ring oscillator with injection signals provided to all four of the quadrature nodes has been shown to significantly reduce multiphase error, even if the injected signals themselves exhibit high multiphase error [43]. Additionally, its also been shown that this technique improves the overall locking range of the ILRO, thereby easing the requirement of the injected frequency to be closely aligned with the natural frequency of the ring oscillator [44]. In this work, this technique has been extended to generate eight phases by combining the PPF and the ILRO architectures. The eight outputs of the threering PPF, shown in Fig. 5(b), are injected into all of the eight nodes of the 8ILRO, as opposed to only two nodes, as depicted in Fig. 5(c). Although the PPF outputs still exhibit the multiphase error of Fig. 6(a), the outputs of the proposed technique exhibit < ±0.4 • error over the entire operating range, as shown in Fig. 6(c). In addition, the locking range of the 8P-8ILRO increases to almost 4 GHz, compared to <1 GHz offered by the 2P-8ILRO counterpart. Therefore, the 8P-8ILRO architecture is chosen for multiphase generation.

C. ASYMMETRIC DOHERTY OUTPUT NETWORK
Although the eight-phase architecture improves efficiency for the |V out,max | contour, the efficiency still degrades in output PBO. A transformer-based asymmetric Doherty architecture can be used to overcome this challenge. This architecture is composed of a main PA, a peaking PA, and an inverter network that work collectively to present optimal impedances to the PAs at the maximum output power and at a given back-off level to achieve efficiency enhancement. The main and the peaking PAs can be combined either in a parallel [18], [45] or a series [36], [37] fashion. However, the parallel transformer adds the currents from the two PAs to the output, causing the load resistance (R L ) to transform up, while the series architecture adds the voltages from the PAs at the output, causing R L to transform down. Since CMOS PAs are voltage-limited, a low impedance is desired to output higher power; therefore, a series Doherty architecture is implemented in this work. Fig. 7(a) shows the schematic of a transformer-based asymmetric series Doherty architecture, which consists of two transformers driven by a differential main and a differential peaking PA. The voltage supply for the PAs is provided through the virtual shorts at the center tap of the primary coil of the transformers. The secondary coils of the two PAs are connected in series to achieve a single-ended output, and a 90 • transmission line with a characteristic impedance of R L /α is placed in series with the secondary coils. The peaking PA is driven 90 • out of phase with respect to the main PA to account for the delay in the transmission line.
To achieve efficiency enhancement in PBO, the current drive strengths of the main and peaking PAs need to satisfy as shown in Fig. 7(b), where α denotes the back-off level on a linear scale when the peaking amplifier turns off and efficiency enhancement is achieved [46]. The resulting impedances seen by the main and peaking PAs are shown in Fig. 7(c), where Z Main = (α − 1).Z Peak at 0-dB PBO, as desired, since the peaking PA is (α − 1) times bigger than main PA based on (2). As the peaking PA turns off, the impedance seen by the main PA increases by a factor of α, due to the nature of the 90 • transmission line, leading to the desired efficiency enhancement at 20.log(α)-dB PBO, as shown in Fig. 7(d).
The transmission line, shown in Fig. 7(a), is not practical to implement on chip for sub-7 GHz designs due to its large size and therefore needs to be implemented using lumped component approximations. Although high-pass (π and T) and low-pass (π and T) equivalent LC networks all provide viable solutions, the high-pass T network will result in a compact design, as shown in Fig. 8(a) and (b). Here, the value of the inductance and the capacitance is given by Z 0 /ω = R L /(ω.α) and 1/(Z 0 .ω) = α/(ω.R L ), respectively [37]. k 1 and k 2 represent the magnetic coupling coefficients of the main and peaking transformers, respectively, and the inductors L res1 and L res2 have been added to resonate the parasitic capacitors of the two PAs. Given that a practical transformer with a coupling coefficient k provides a parallel magnetizing inductance given by k 2 .L, and a series leakage inductance given by (1 − k 2 .L), they can be used to produce the necessary inductance for L res and the transmission line, as shown in Fig. 8(a). Therefore, the entire asymmetric series Doherty output network can be implemented simply using two transformers (L p1 -L s1 and L p2 -L s2 ), and a capacitor (C OMN ), as shown in Fig. 8(b).

D. DERIVATION OF DESIGN EQUATIONS FOR SERIES DOHERTY OUTPUT MATCHING NETWORK
This section derives the design equations for a transformerbased asymmetric series Doherty network. Consider the circuit shown in Fig. 7(a). When the peaking PA is turned off at the back-off efficiency enhancement level α, it presents an open circuit to the transformer network, implying Z 3 = ∞ . The 90 • transmission line translates that open to a short, resulting in Z 4 = 0 . Thus, the impedance seen by the main PA is simply the load resistance transformed by the coupling coefficient k 1 , and the transformer turns ratio n 1 , given by Therefore, to achieve the desired efficiency enhancement, the output network needs to ensure as discussed in the previous section. Upon applying the power conservation principle and the impedance inversion concept across the transmission line, it can be readily shown that (4) and (5) can be satisfied, as long as where, by definition Combining (4)- (7), results in Note that L p1 = L res1 and L p2 = L res2 , as shown in Fig. 8(a). Since these inductors are used to resonate the output capacitance C 1 and C 2 of the main and the peaking PAs, respectively, (8) and (9) can be rewritten as It is important to note that Z Main, 0dB PBO and Z Peak, 0dB PBO determine the output power of the PA, and the size of the PA (total transistor width) needs to be designed such that it is inversely proportional to these impedances for achieving maximum efficiency. Also, note that the size of the PA directly affects its output capacitance; therefore, for a given topology, one can conclude that where β is a constant, and its value can be determined through simulations. In a general-purpose 65-nm CMOS process, for a cascode cell using the nominal thin-oxide transistors, β = 15 .pF. Therefore Next, since the series inductance of the 90 • transmission line, shown in Fig. 8(a), is implemented using the leakage inductance of the transformers Finally, the design parameters of the proposed output network can be solved by using (13)- (16), resulting in For a load resistance of 50 , the above equations can be plotted with respect to the operating frequency for a family of efficiency enhancement back-off levels, as shown in Fig. 9. These plots can be used to determine the values for L s1 , L s2 , k 1 , and k 2 . It is important to note that these plots assume β = 15 .pF, and the value of β will change (and the plots will need to be regenerated) if a different CMOS process is used, or if a different unit cell topology is used (e.g., a cascode cell with thick-gate oxide transistors).

III. IMPLEMENTATION
The overall block diagram of the implemented transmitter is shown in Fig. 10. A CW differential RF signal is provided to a three-ring PPF whose outputs are used to injection lock an 8ILRO to achieve evenly distributed eight phases, which behave as the basis vectors for the digital PA. Based on the desired output phase, the two adjacent basis vectors, φ A /φ B , and their differential counterparts are selected through an eight-phase basis vector mapper and fed to the main PA. Since the peaking PA is driven 90 • out of phase with respect to the main PA, φ C /φ D and their differential counterparts are fed to the peaking PA. Therefore, all of the eight basis vectors are always assigned to either the main or the peaking PA at any given time. The digital PA cells perform the nonlinear combination of these basis vectors for producing the desired output amplitude and phase. The outputs of the main and the peaking PAs are combined through a single-footprint transformer-based Doherty network to achieve efficiency enhancement in deep PBO and achieve single-ended output while maintaining a compact form factor. It is important to note that process, voltage, and temperature (PVT) variations can introduce errors between the phase relationship of the basis vectors. This could degrade the performance of the output network since the behavior of the Doherty architecture relies on the phase relationship of the amplifiers, as shown in Fig. 7(a). Techniques, such as delay alignment [24], over-drive voltage tuning [5], and duty-cycle correction [47], can be used to compensate for these effects, though they were not implemented in this work.

A. WIDEBAND EIGHT-PHASE GENERATION
The overall block diagram of the implemented wideband eight-phase generation technique is shown in Fig. 11. The three-ring PPF generates eight-phase vectors that are fed to bias tee blocks followed by NOT gate buffers. The bias tee ensures that the signal swing is centered at V DD /2 to maximize the gain provided by the buffers, which are used to overcome the voltage attenuation caused by the PPF [48]. The PPF outputs exhibit significant phase deviations (explained below); therefore, the buffered signals are injected in an 8ILRO to reduce these deviations. Only nMOS devices are used for injection into the oscillator to prevent dc bias interaction at the injection nodes [49], [50], and crosscoupled NOT gates are added between the differential nodes of the ring oscillator to prevent it from latching. Finally, another set of bias tee blocks followed by NOT gate buffers are added at the outputs of the oscillator (not shown in the figure) to de-couple the dc value of the signal swing from the oscillator's power supply.
All of the resistors and capacitors of the three-ring PPF are chosen to be about 800 and 30 fF, respectively, resulting in an overall high input impedance (single-ended) of about 400 || 60 fF. This impedance can be easily matched close to 50 for a wide range of operating frequencies by simply placing 56 resistors in parallel at the differential RF inputs. The resulting post-parasitic extracted voltage swing attenuation from the input (singleended) to the outputs (single-ended) of the PPF is shown in Fig. 12(a). The attenuation increases with the increase in the frequency of operation due to parasitic load capacitance. Although the PPF is robust to process and temperature variations [48], the mismatch between the resistive and the capacitive elements can degrade the performance. In this work, the layout-related asymmetry is the dominant source of mismatch, resulting in the outputs of the PPF to exhibit a significant amount of deviation from the ideal phase, as high as −17 • , as depicted in Fig. 12(b). However, the ILRO corrects these errors and substantially reduces the deviation to < ±1.5 • , as shown in Fig. 12(c), highlighting the advantage of the implemented 8P-8ILRO technique. Additionally, the 2-D lookup table, mentioned in Section II-A, further reduces the effect of this error on transmitter performance.

B. EIGHT-PHASE BASIS VECTOR MAPPER
A standard multiplexer tree with 3-bit control can be implemented as the mapper; however, it results in a total of 56 2-to-1 multiplexers (seven multiplexers per path for eight paths) and a considerable number of cross-overs between the signals since all eight input signals need to be routed to all eight paths. Such a design increases parasitic coupling that results in increased multiphase error. However, given that the outputs of the mapper have a fixed phase relationship, its design can be simplified with the use of only 24 multiplexers that are divided into three columns with a 3-bit control, as shown in Fig. 13. Further, each input signal is only routed to two switches, which ameliorates the parasitic coupling issue. The outputs of the mapper are followed by an additional set of NOT gate buffers in a fan-out structure, as shown in Fig. 10, to drive the PA unit cells.

C. PA UNIT CELLS
The peaking PA is designed to be twice the size of the main PA (i.e., α = 3) for the following three advantages.
1) It results in an asymmetric Doherty network that ideally achieves efficiency enhancement up to 9.5-dB PBO, according to (1). 2) It eases the design of a transformer-within-transformer structure since the asymmetric architecture inherently desires two asymmetric transformers making it easier for one to be inserted inside another. 3) It reduces a significant amount of layout effort due to the reuse of implemented cells. For example, the peaking PA can be implemented by simply "copying and pasting" the main PA twice; similarly, the buffers driving the main PA can be reused.
It is important to note that since the peaking PA is double in size compared to the main PA, the output of the eight-phase basis vector mapper needs to drive uneven loads. Therefore, a dummy buffer is added to compensate for this mismatch, as shown in Fig. 10.
The block level schematic of the unit cell is shown in Fig. 14. Every cell receives both of the adjacent basis vectors (φ A /φ B ) and their corresponding enable signals. A symmetric NAND gate is utilized to ensure identical load capacitance for all the RF paths [24]. The output stage is implemented using a cascode topology to allow for a higher voltage swing handling capability, and thereby increase output power. The sizes of all the devices are also mentioned in Fig. 14.
The PAs are implemented using 6-bit thermometer-coded unit cells that are arranged in a 2-D grid consisting of row and column signals. The enable signal is calculated through a simple digital logic given by [(row i .column j ) + row i+1 ], where 0 ≤ i, j ≤ 7, and row 0 = 1, row 7+1 = 0, and column 7 = 0. To reduce the number of pads on-chip, these row and column signals are derived from a 6-bit binary code that is split into two sets of 3-bit binary codes, and each of them is then converted to a 7-bit thermometer code through an on-chip decoder [9]. Since each cell needs two enable signals, one per basis vector, the main PA and the peaking PA each have a total of 2×6-bit control. The two enable signals are designed to traverse in a reverse order on the 2-D grid to ensure only one enable signal is activated per unit cell, at any given time [11]. TSPC-based flip-flops are used in every unit cell to correct the timing mismatch associated with the control signals. Fig. 9 is used as the starting point for the output network design operating at 6.5 GHz that achieves efficiency enhancement at 9.5-dB PBO. The resulting design parameters are L p1 = 0.63 nH, L s1 = 1.07 nH, L p2 = 0.31 nH, L s2 = 0.74 nH, k 1 = 0.77, k 2 = 0.67, and C OMN = 1.47 pF.

D. SINGLE-FOOTPRINT TRANSFORMER-BASED DOHERTY NETWORK
To achieve a compact single-footprint design, a transformer-within-transformer structure is proposed and implemented in this work. However, placing one transformer inside another leads to parasitic magnetic coupling between L p1 -L p2 , L p1 -L s2 , L p2 -L s1 , and L s1 -L s2 , defined as k p1 , k p2 , k p3 , and k p4 , respectively. This undesired coupling can significantly reduce the performance of the output network. Therefore, to reduce the parasitic coupling, the inner transformer is twisted into a figure-8 structure, as shown in Fig. 15(a). The resulting magnetic flux in the two "octagons" of the figure-8 structure are almost equal in magnitude but opposite in the phase; therefore, the net induced current in the non-figure-8 loops due to the figure-8 loops is close to zero, reducing the parasitic coupling [51].
Iterative EM simulations are performed to reach the final solution shown in Fig. 15(b). The passive efficiency, which accounts for the power loss in the implemented Doherty network, is close to 70% at maximum output power, and almost 59% when the peaking PA is switched off, as shown in Fig. 15(c). Fig. 15(d) shows that the impedance seen by the main PA is almost twice compared to the peaking PA at maximum output power, as desired, and Z Main increases to more than 2× when the peaking PA is turned off, which results in efficiency enhancement in PBO. Note that ideally, Z Main needs to increase by 3×, but the impedance boost is reduced in practical implementations due to the loss in the transformers, and the noninfinite off resistance of the peaking PA, preventing the efficiency enhancement from the reaching the design value of 9.5-dB PBO.

IV. MEASUREMENTS
The die photograph of the implemented transmitter is shown in Fig. 16. The 27 control bits (12 bits each for the main and the peaking PA + 3 bits for the basis vector mapper) are driven by an external pattern generator. The CW RF input is generated externally through Keysight's N5245A PNA-X vector network analyzer followed by an external balun to produce a differential signal at the transmitter's operating frequency, which is then provided to the chip. The output cascode stage of the transmitter uses a 1.25-V supply, the NOT gate driving the CS stage of the cascode structure uses a reduced 0.85-V supply to improve the efficiency of the output stage, and the injection-locked ring oscillator supply (V RO ) is set to 1.01 V to achieve the optimal phase noise performance at 6.5 GHz, as shown below; the rest of the supplies use 1.2 V. For CW measurements, the output of the transmitter is connected to an external wideband power splitter, where one of the outputs is connected to the PNA-X for measuring the output phase, while the other is connected to a Rhode and Schwarz's NRP-Z57 power sensor to measure the output power. For modulated measurements, the output of the transmitter is connected to Keysight's N9030A PXA spectrum analyzer for vector signal analysis.

A. CW MEASUREMENTS
The measured differential input reflection coefficient (S dd,11 ) is shown in Fig. 17, and it remains below −20 dB throughout the operating range. Next, the phase noise performance and the locking range of the 8P-8ILRO are measured for an input power of 6 dBm (3 dBm at each of the differential RF inputs) provided to the chip. For a V RO of 1.01 V, the implemented ILRO achieves optimal phase noise performance at 6.5 GHz, as shown in Fig. 18. Even though the phase noise degrades as the input frequency is varied, the oscillator achieves a locking range of more than 1 GHz at a constant V RO . The phase noise measurement for the entire achievable locking range is also shown in the figure and is compared to the phase noise of the input PNA-X signal (green) provided to the chip. The locking range can be improved by increasing the input power; however, it results in a reduction of transmitter gain. Thus the input power is set to 6 dBm to achieve a gain of about 14 dB for the entire operating frequency range. It is important to note that the phase noise of the ILRO can be improved as the amplitude of the injection signal is increased; however, in this work, the performance is limited by the measurement setup.
Next, the frequency dependence of the transmitter is measured for DE, P out , and gain, as shown in Fig. 19(a) and (b). Here, the supply of the ring oscillator is varied manually with respect to the operating frequency to achieve optimal phase noise performance and to compensate for PVT variations of the ring oscillator. Automatic supply tuning can be achieved through a feedback loop from the output of the ring oscillator, though it was not implemented in this work. The transmitter achieves a DE of > 30% for the frequency range of 4.5 to 7.0 GHz, with a maximum of 38% at 5.75 GHz, 34% at 6.5 GHz, and 31% at 7.0 GHz. It also achieves more than 20% DE in PBO (Peaking PA turned off) from 6.1 to 7.0 GHz, with 24% at 6.5 GHz, and a maximum of 26% at 7.0 GHz, as shown in Fig. 19(a). The narrow-band nature of the 90 • transmission line in the output network limits the DE in PBO at lower operating frequencies. Finally, the transmitter attains more than 20-dBm P out and close to 14-dB gain from 4.5 to 6.7 GHz, as depicted in Fig. 19(b). This response indicates the strength of the proposed 8P-8ILRO technique for generating multiphases for a wide range of operating frequency. Given that the transmitter performs optimally in PBO above 6.1 GHz, as seen in Fig. 19(a), it is measured for DE and system efficiency (SE) for the entire PBO range at 6.25, 6.5, 6.75, and 7.0 GHz, and it is compared with a normalized class B PA's performance, as shown in Fig. 20(a)-(d), respectively. Here, the SE includes the power dissipated in the output cascode stage, all of the digital circuits (eight-phase mapper basis vector mapper + LO distribution buffers + digital circuitry in every unit cell), the ILRO, and the input RF power. The power dissipation breakdown for these blocks is shown in Table 1. The implemented transmitter achieves a maximum DE of 34%, 33%, 32%, and 31%, and a DE of 22%, 23%, 24%, and 24% at 8-dB PBO, at the aforementioned frequencies, respectively. This corresponds to an improvement of 1.61×, 1.76×, 1.88×, and 1.93× compared to a normalized class B PA, highlighting the improved performance offered by the proposed single-footprint transformerwithin-transformer-based asymmetric series Doherty network.  Next, the DE at maximum P out (m + n = p) contour is measured with respect to the normalized output phase for all 3-bit combinations of the basis vector mapper code bits at 6.5 GHz, as shown in Fig. 21. This indicates that the worst-case efficiency at maximum P out is relatively high; for example, code <101> results in a maximum DE of 33%, and worst-case DE of 25%, which corresponds to a 0.24× reduction. On the other hand, an idealized quadrature architecture, described in Section II-A, shows a simulated worst-case DE reduction of 0.45×. This illustrates the advantage of the proposed eight-phase architecture. Fig. 22 shows the measured output voltage amplitude and output phase at 6.5 GHz over the entire back-off range, and for all the basis vector mapper control bit combinations, resulting in a total of 48 896 data points. The curves on each colored sector represent the AM-PM nonlinearity caused by the variation in output capacitance of the unit cells as they are switched on and off. Given the current-mode implementation of each unit cell, this transmitter also exhibits a significant amplitude modulation-to-amplitude modulation (AM-AM) nonlinearity, which is captured in this measurement as well. The next section explains that this measured data is used to create a 2-D lookup table for performing modulated measurements.

B. MODULATION MEASUREMENTS
For modulated measurements, the desired oversampled baseband modulated waveform is first generated in MATLAB, and then it is mapped to the corresponding control bits of the transmitter using the 2-D lookup table. The resulting digital pattern is uploaded to the pattern generator's memory, which is triggered using the same clock source that is provided to the chip for clocking the TSPC flip-flops.
The measured 32-QAM and 64-QAM constellations at maximum P out with increasing symbol rates at 6.5 GHz, are shown in Fig. 23(a) and (b), respectively. For 32-QAM, the transmitter achieves a DE of 21% with an average P out of 14 dBm, EVM rms of 3.5%, and ACLR of 30.6 dBc for a 12.5-MSym/s waveform. The EVM rms increases to 5.4% and ACLR reduces to 19.5 dBc as the symbol rate is increased to 40 MSym/s. Similarly, for 64-QAM, the transmitter achieves a DE of 21% with an average P out of 14 dBm, EVM rms of 4.0%, and ACLR of 30.9 dBc for a 12.5-MSym/s waveform. The EVM rms increases to 4.5% and ACLR reduces to 24.5 dBc as the symbol rate is increased to 25 MSym/s. The degradation in EVM rms at higher symbol rates is due to the timing mismatch related to the asynchronous 3-bit control of the eight-phase basis vector mapper, which limits the data rate of the transmitter.
The transmitter can also produce a 128-QAM constellation for a 12.5-MSym/s waveform at 6-dB PBO (average P out = 7.5 dBm) with a DE of 13% and EVM rms of 2.1%, as shown in Fig. 24(a). It achieves an ACLR of 29.9 dBc, as shown in Fig. 24(b).
The modulation performance of the transmitter (DE and EVM rms ) is also characterized for dependence on P out , for 16-QAM, 32-QAM, and 64-QAM constellations at 20 MSym/s, and for 128-QAM constellation at 12.5 MSym/s, as depicted in Fig. 25(a)-(d), respectively. As expected, the DE increases with P out , but it also results in worse EVM rms . For example, as the average P out is increased from 7 to 14 dBm for the 64-QAM waveform, the DE increases from 12% to 21%; however, the resulting EVM rms also increases from 2.9% to 4.1%. Finally, the far-out spectrum for a 20-MSym/s 64-QAM waveform at 6.5 GHz is shown in Fig. 26. Since the baseband waveform is 8× oversampled, the zero-order hold (ZOH) sampling images occur at 160-MHz offset. These can be lowered by increasing the oversampling factor or using a higher-order sampling hold [24]. Table 2 compares the implemented transmitter with other state-of-the-art CMOS transmitters and PAs operating above 5 GHz that also achieve efficiency enhancement in PBO. The efficiency numbers reported in the table for this  work are taken for the scenario when the transmitter is operating completely along the basis vector. As shown in Table 2, the implemented transmitter achieves significantly lower die area compared to other works owing to the proposed transformer-within-transformer structure, while also demonstrating competitive performance in PBO.

V. CONCLUSION
This work proposes two techniques: 1) a current-mode multiphase digital PA architecture to overcome the bandwidth expansion associated with the polar architecture, while still achieving relatively constant output power and DE profiles and 2) a transformer-within-transformer structure for realizing a compact asymmetric Doherty output matching network, while also improving the PA's efficiency in PBO. Additionally, an 8P-8ILRO technique is also demonstrated for wideband multiphase generation to overcome the need for using a 4× multiple of operating frequency at the RF input. These techniques are implemented in a general-purpose 65-nm CMOS process to demonstrate a digital transmitter with a small die area and competitive performance around 6.5 GHz.