A mm-Wave Switched-Capacitor RFDAC

—This article proposes an interleaving switched- capacitor RF digital-to-analog converter (RFDAC) using an edge combiner within the output stage to implicitly triple its effective clock carrier frequency and enable the mm-wave (mmW) operation. Tripling in the output stage allows for increased energy efﬁciency, which is further improved by employing an edge-combining-based frequency-tripling delay-locked loop (DLL) in the clock generation network. The clock tripling is performed in each slice of the switched-capacitor PA (SCPA), which allows yet another 3 × frequency reduction for the global clock distribution. Finally, a new layout structure accounts for transmission-line (TL) effects, due to the large physical size of the passive capacitor array. Implemented in 22-nm FD-SOI, the prototype achieves P out > 21 dBm, drain efﬁciency > 36%, and system efﬁciency > 22% while operating in the Ka -band at 28 GHz. Modulation at 2.4 Gb/s results in 3.3% EVM and 30.8-dBc adjacent channel leakage ratio (ACLR).


I. INTRODUCTION
S INCE the introduction of "digital RF" [1], digital transmitters (TXs) have gained a lot of interest due to their amenability to CMOS technology scaling. Many digitally intensive TX designs have followed over the years with ever improving performance [2]- [7]. Their subset, digital power amplifiers (DPAs) and RF digital-to-analog converters (RFDACs), have consistently become more competitive when compared to traditional analog PAs, particularly in the sub-6-GHz range. With modern 5G signals, the demand for higher output power, higher energy efficiency (at peak and back-off output power), and increased linearity are the key considerations. In addition, the ability to operate at mm-wave (mmW) is gaining importance.
DPAs and RFDACs combine the functionality of a DAC+mixer and power combiner, as shown in Fig. 1. They are amenable to technology scaling due to utilizing the transistor as a switch. In general, they can be classified into two architectures based on their signal summation (e.g., current and voltage modes). For the current-mode architectures, several innovative designs have been presented. In [4], the currentmode-based class-E switching PA utilized single-MOS devices as power switches for the RFDAC's unit cell, which is easy to implement but requires digital pre-distortion (DPD) to achieve linear operation. In addition, a power-combining network was required to obtain high output power, and the power supply was increased, resulting in potential reliability issues. In contrast, voltage-mode structures [e.g., switchedcapacitor PA (SCPA)] [2], [5], [8], [9] show better reliability (due to fixed voltage swings on all devices) and linearity without using DPD. In fact, the SCPA can exhibit improved linearity compared to analog PAs [10] without compressive behavior when bondwires are properly accounted for. Hence, the SCPA typically does not require additional output power backoff for linear operation [11]. In addition, RFDACs 1 typically offer improved energy efficiency when compared to analog PAs. At sub-6-GHz frequencies, RFDAC structures have been recently improved with many innovations that focus on increasing the back-off efficiency using switched transformers [12]- [14] or sub-harmonic LO techniques [15], [16]. A key challenge with RFDACs is that they have mostly been limited in operational frequency due to the limited ability for a transistor to switch at higher frequencies.
In contrast, analog PAs can operate at any frequency up to the transistor's f max . Because of this, CMOS analog PAs are dominant at mmW frequencies, with many architectures proposed to obtain improved performance in key metrics of linearity, output power, and efficiency.
Recently, a current-mode inverse-outphasing TX [17] and a Doherty PA [18] were introduced to work in the K a-band and obtained high efficiency (>30%) with high output power (>20 dBm). Outphasing and Doherty architectures have proven reliable in improving back-off efficiency. However, they require a complex passive output network to implement that may limit usable bandwidth. To operate as digital TXs, the aforementioned architectures also require a high-speed DAC that decreases the system efficiency, η SE [18]. To overcome this, power DACs [19], [20] have shown promise. In [19], multi-stacked power MOSFETs were proposed to work as DACs at mmW, while Gilbert cells were used as up-converters; however, due to the wide-bandwidth specification, the peak η SE was reduced to 10.3%. In addition, at mmW, distributed effects contribute significantly to load-impedance variation, resulting in output power reduction. The synthesized impedance variation technique shows a means to compensate for the distributed effects in the combination network and increases η SE up to 22%. A DPA with non-constant envelope-modulation class-E at 60 GHz in [21] obtained 17.7% efficiency, while a polar RFDAC with a transformer-coupled two-stage PA [22] achieved 15.3% efficiency.
Through all of the efforts to bring RFDACs to the mmW band, the biggest issue is the reduction of efficiency in the output stage, thus ruining the total system efficiency and wideband operation. First, deploying polar RFDACs is not advisable due to the polar bandwidth expansion [1]; hence, I /Q architectures are the best choice for mmW wide-bandwidth requirements. The efficiency degradation at mmW comes from three primary sources. As shown in Fig. 1, the conventional I /Q RFDAC [3] contains a logic gate that is implemented to mix the LO signal with the baseband data ➀. Due to the high frequency of operation, generating the LO to switch the output stage is challenging and consumes significant power, causing the degradation of η SE ➁. This signal then controls a parallel bank of switches/current sources that are summed/combined at the output before being fed to the antenna ➂. In this case, the node where the switches/current sources are combined must account for the distributed effects at mmW. Beyond this, the switching elements are constructed from MOS devices with limited switching speed due to the parasitic device capacitance. As a result, hard switching is not generally possible at mmW [23]. In addition to efficiency, achieving good linearity has proven difficult for mmW RFDACs because to date all have operated in the current domain. Current division causes saturation of the output voltage characteristic in this domain; hence, it has been necessary to use low-to-moderate bandwidth signals with complicated DPD to meet linearity requirements.
To solve the challenges limiting RFDACs at mmW, an edgecombining (EC) SCPA (EC-SCPA) is presented for the first time. The EC-SCPA uses an embedded frequency multiplier to bring the voltage-mode RFDAC to K a-band operation, maintaining similar linearity and efficiency associated with lower frequency SCPAs. In the proposed EC-SCPA, the system efficiency is enhanced by embedding sub-harmonic injectionlocking techniques to generate the LO/clock signals, and a novel partitioned layout allows for overcoming distributed layout effects. The proposed EC-SCPA achieves peak drain efficiency η = 36% and system efficiency η SE = 22% with good linearity while transmitting single-carrier 16-and 64-QAM signals.
This article is organized as follows. Section II explains the limitations and main challenges in bringing RFDACs to mmW based on SCPA analysis and investigation and then shows the proposed circuit innovation architecture for the mmW RFDAC. Section III describes the circuit implementation. Measurement results and discussion are presented in Section IV, which showcases the performance of the prototype design.

A. Conventional SCPA at mmW
Voltage-mode RFDACs (e.g., SCPAs [2]) have proven nearly optimal for wireless TXs in CMOS because they occupy a small chip area, are energy efficient, and benefit from technology scaling. Every transistor in an SCPA is switched ON/OFF at the carrier frequency rate of f c (i.e., the devices are not linear current sources); hence, the power consumption scales with CV 2 DD f c . The SCPA is a segmented class-D amplifier (see Fig. 2) where switches and capacitors are segmented into parallel paths that are digitally enabled (disabled) to increase (decrease) the RF output voltage, thereby providing an overall linear amplitude control at the RF output. The cascoded switch requires two voltage supplies V DD and 0.5 × V DD , but it is noted that a cascode is not required, particularly for low-power operation. The total capacitance seen from the inductor is the sum of all unit capacitors C unit . In ideal operation, the inductor L M is resonant with the sum of all capacitors, and the resistor is equal to the optimal termination resistance, R opt . L M and R opt can be replaced with an output-matching network, transforming from a fixed output impedance (e.g., 50 ) to the presented equivalent.
The RFDACs have proven competitive for sub-6-GHz cellular and connectivity applications [4], [24]- [27]. They combine the embedded DAC and mixer functionalities with high RF output power capability into a single, energy-efficient, and globally linear block that utilizes only switching transistors. In the development and deployment of 5G technology, the mmW band plays a vital role. Hence, significant work has been done in mmW transceivers. Bringing direct-digital transmitters, including RFDACs [21], [22], to the mmW band has proven challenging due to the limited switching speed of CMOS transistors. Although challenging to implement, there are obvious advantages to the direct-digital RFDAC approach, such as enabling fully connected digital beamforming [9].
To understand the challenges in pushing RFDACs into the mmW territory, we first assume a traditional RFDAC designed as represented by the SCPA in Fig. 2. It receives the N-bit digital baseband input signal DATAN:1, which is then mixed with the carrier clock of frequency f c by a logic gate residing in each individual RFDAC cell. The outputs of the RFDAC cells are combined and filtered by the LC tank to generate the modulated output RF out . The RFDAC is a combination of the switched unit cells; hence, the switching speed of the RFDAC is limited by the transition frequency, f T , of the MOS devices that are utilized as switches. However, this is dependent on the process technology. Hence, the highest operational frequency of the RFDAC, while maintaining the aforementioned advantages, is an initial point of investigation. Now, a brief review of the SCPA follows. The SCPA is effectively an SC RFDAC. The unit cells of the SCPA are switched at the RF carrier frequency, f c , between V DD and V GND or held at V GND based on the input control signal that is derived from the input digital baseband being a binary N-bit stream operating at the rate of f BB , which is typically f c . For a certain input code, a proportional capacitance (C ON ) will be undergoing the switching, while the disabled capacitance (C OFF ) is shorted to the ground. The total capacitance is C total = C ON + C OFF . The top-plate voltage V D amplitude produced by the capacitor array can be calculated by scaling the supply voltage by the ratio of the enabled capacitance to the total capacitance: V D = (C ON /C total ) × V DD . V D is converted to RF out using a matching network that is resonant at the same frequency as the switching rate f c . In theory, the ideal total efficiency (η) is 100% when considering lossless switches and passives. In practice, based on the conventional derivation from [2], the total system efficiency η SE can be defined as the following equation, which only accounts for the switches and drivers η SE = P out P dc = P out P out + P loss + P driver (1) where P out is the RF output power, the driver burns P driver , and P loss represents the combination of the switching P sw and dynamic P SC losses, which are calculated as follows: where n(N − n) is the number of active unit cells that are turned on (off) at the RF carrier rate f c . The two losses in (2) and (3) depend on f c . At sub-6 GHz, the unit capacitor C u = C total /2 N is significantly larger than the switch parasitic capacitance and the parasitic components of the impedancematching network. Hence, the dynamic power loss from the switch array has a lower effect on η. At mmW, the physical size of the passive impedance-matching network is reduced. Furthermore, the increased frequency is accompanied by a reduction of the total capacitance required in the capacitor array such that its C u is comparable to the parasitic capacitance of the switches. Hence, the parasitic capacitors (e.g., C gs and C ds ) have a significant impact on reducing η. To quantify the impact, the parasitic capacitors are added into (2) as C par , which is rewritten as follows: Therefore, P total is reduced by the factor C par V 2 DD f . In addition, at mmW, the switching frequency f c will degrade the output power P out because the switching pulses will have relatively slower transitions due to the f T limitations of the switching devices.
To offer a better intuition of how increasing f c impacts the design, a 6-bit model is created in which the normalized output power is constrained to be constant across f c . The "experiment" uses real MOS device models from the commercial 22-nm FD-SOI PDK. Two cases are tested: at 5 and 28 GHz at the same P out = 19-dBm target. The resulting plots of η versus P out are shown in Fig. 3(a). At f c = 5 GHz, η obtained from the model is almost 60% but reduces to 20% at f c = 28 GHz. With this, the system efficiency η SE , which includes all the required power to operate the SCPA, will further degrade significantly. This level of η SE makes the standard SCPA not competitive at mmW even when compared to linear amplifiers (e.g., class-B). Next, the driver power is also considered. It can be calculated as where C driver is the total capacitance along the driver chain. Because the driver is associated with the switching inputs, its power drain is added to the denominator in (1), thereby reducing η SE . It is thus clear that operating the standard SCPA at mmW would burn huge power in the driver. It is further noted that the switching might actually be nearly impossible at mmW due to the limited f T of the devices, meaning that the expected "square-wave" operation would not be possible [23].
As with any TX that includes an RF PA, the design of RFDACs involves tradeoffs between the efficiency (e.g., η and η SE ), output power (P out ), and dimensions of the output stage power device (e.g., W/L). To a first order, P out is proportional to W/L. However, a larger W does not necessarily lead to increased efficiency, due to the previously discussed parasitic effects. Using our model, we sweep W for carrier frequencies ranging 2-34 GHz. The resulting efficiency is plotted in Fig. 3(b). As predicted, the efficiency decreases as the frequency increases. Also of note is that the efficiency generally peaks for smaller device sizes. This makes sense, as the parasitic power is increased with W . However, although the efficiency may have an optimal value for a smaller W , this will also keep a limit on P out . This tradeoff is exacerbated at higher frequencies, where maintaining the specified power requires increasing the device size and operating farther from the optimal size for peak efficiency.
Another interesting comparison is the energy usage per cycle, E dc = 1/ f c 0 P dc (t)dt, as a function of f c . Here, the model is scaled to keep the same P out = 19 dBm. The results are plotted in Fig. 3(c) and clearly show that significantly more energy is spent at higher frequencies to deliver the same P out .
Hence, two problems associated with mmW operation of the SCPA can clearly be identified (provided, of course, that the technology is fast enough to ensure the reliable switching of devices) as follows.
Problem II-1: The efficiency is reduced due to the increase in dynamic power consumption, due to the parasitic resistance and capacitance. This is exacerbated when the unit capacitance is comparable to the switch parasitic capacitance.
Problem II-2: More power is consumed by drive switches at mmW, resulting in significant η SE degradation.

B. Proposed Edge Combiner Based on XOR Function Switches
First, considering (2) and (3), P SC is difficult to reduce due to its intrinsic dependence on C total , which is tied to the resonant network construction. However, reducing the dynamic losses of the switches and the driver appears more feasible. Prima facie, reducing the switching frequency of the MOS switching devices appears to reduce the power consumption linearly, proportional to the reduction in switching rate. However, irrespective of the final recombination technique used (e.g., edge combiner, parallelism/interleaving, and pipelining), the net rate of effective edges in the switched-mode PA architecture (such as our SCPA) cannot change as it must be the same as the carrier frequency, f c . Consequently, P sw would stay the same, assuming that C sw is invariant. For this reason, it is more intuitive to consider the consumed energy per cycle rather than the power: Fig. 5(c) reveals that E SC (thus, C sw ) stays relatively constant at lower frequencies but then suddenly starts fiercely increasing in an exponential-like manner. This region of sudden expansion is termed "knee." Operating below the knee (∼10 GHz) would ensure the optimally low energy consumption. However, the final switch rate of ∼30 GHz will be regained using the proposed edge-combining (EC) frequency multiplication technique that acts as an interleaver.
The proposed interleaving SCPA architecture is shown in Fig. 4(a). The input LO frequency f LO is reduced from f c to f c /M, where M is the implicit clock multiplication factor via the proposed edge combining of 1-bit up-converted data. Each SCPA unit cell now switches at f c /M, which is M times lower than in the conventional architecture, which means likely operating below the energy knee.
A multiply-by-M SCPA stage is shown in Fig. 4(b). The circuit leverages the same mechanism that allows a digital XOR gate to obtain frequency multiplication. In the two-input XOR gate, if the inputs are at the same frequency, but offset in phase, the output will be at 2× the input frequency. Additional branches can be added, which enables higher multiplication factors. The output-matching network can be tuned to be resonant at f c . In the multiplying circuit, there are now M× as many branches as in the conventional SCPA. To a first order, there are now M switching elements, operating at f c /M; hence, the power consumption should be the same, but the switches now operate below the energy knee of Fig. 3(c). Thus, they can be substantially reduced in size. P sw is now re-defined as where C sw represents the parasitic capacitance of the modified (smaller) switch for the reduced switching frequency from f c to f c /M.
A model of the interleaving SCPA is constructed for comparison with the conventional SCPA. It is simulated for three different frequency multiplication factors M = 1, 2, 3, to produce 28 GHz. The simulated efficiency as a function of the normalized output power is plotted in Fig. 5(a). The peak efficiency for the case of M = 2 increases by 8% compared to the conventional (M = 1) case, while for M = 3, it increased by ∼18%. The energy per carrier cycle for the three cases is plotted in Fig. 5(b) and shows great improvement, especially at mmW frequencies.
As noted above, higher values of M are possible with more parallel branches and could lead to further reduction in the power consumption of the output stage at higher carrier frequencies, i.e., beyond the "knee" of Fig. 5(b) where the interleaving overhead cost happens to be lower than the potential energy savings. To operate with M > 3, more phases are required, so the edge combiner (later shown in Fig. 10) would need to be re-designed. This would lead to an additional overhead in energy consumption for the generation and distribution of the phases, which reverses some of the power reduction in the output stage.
The benefits of the SCPA interleaving extend also to the drivers as they individually switch at the likewise lower frequency, f c /M, allowing to reduce their size. Equation (6)  can be extended in a similar manner to the driver network where C driver is the total capacitance in the single driver chain. Due to the operation at lower frequency, the driver chain's transistor width can now be optimally sized for both energy usage and edge rate. This allows for more efficient operation. This was explained above in conjunction with operating below the energy knee of Fig. 5(b).

C. Proposed Edge-Combining SCPA Architecture
In a step to realize the aforementioned advantage of frequency multiplication via interleaving within the output stage of the mmW SCPA, we can draw the resulting architecture in Fig. 6(a). A three-way edge combiner (EC1) is used to implicitly multiply the input frequency 3×. Hence, the input LO frequency is f c /3. The digital baseband data are mixed logically with the carrier sub-harmonic to generate the modulated signal before passing through the buffer stage in order to drive the edge-combining switch-cap unit cell EC1.
It is noted that in the case of an output frequency of f c ≈ 27 GHz, the distribution of the input clock waveform f LO ≈ 9 GHz can still be cumbersome. Hence, a further power reduction is made to apply the edge combining to the clock generation and distribution. This is shown in Fig. 6(b). The input f LO frequency multiplication by 3× uses and edge combiner EC1 followed by a delay-locked loop (DLL). The details will be provided in the following. In this scenario (and still maintaining M = 3), the clock power consumption is given as follows: where C clk is the total capacitance in the single clock distribution chain. Just as in the previous cases, here by allowing to operate at ∼3 GHz in each chain, the device sizes can be reduced for the overall net reduction in the entire C clk . To accomplish a complex modulation, one EC-SCPA is dedicated to the in-phase (I ) data path, while another to the quadrature (Q) path. The power combiner at the end merges their two outputs. The operational block diagram is presented in Fig. 7. The global injection-locked oscillator generates multi-phase signals at f 1 frequency: P 1,2,3 for the I path and S 1,2,3 for the Q path, as shown in Fig. 7(a). The phase separation between the corresponding S and P signals is φ 1 (90 • /9 = 10 • , in this case). The three phases, P 1 (S 1 ), P 2 (S 2 ), and P 3 (S 3 ), are each shifted 120 • relative to one another. Each SCPA slice has a local DLL, into which P and S are directly injected to generate the tri-phase signals PD 1,2,3 and SD 1,2,3 , as shown in Fig. 7(b). Due to the frequency multiplication, the phase is advanced by the multiplication factor 3×. All output swings are between the full rail-torail levels of 0 and V DD . Signals PD and SD serve as the input of the logical mixer to up-convert the baseband data and to drive the EC-SCPA unit cell [see Fig. 7(c)]. Because the proposed XOR frequency-tripler circuit requires dual-level input for pMOS and nMOS, high-speed level shifters are designed to bring the three-phase signal swing from 0-V DD to V DD -2×V DD [see Fig. 7(c)] to drive the pMOS switches.
To obtain the aligned phases for both pMOS and nMOS switches, buffer chains are added to the nMOS driver to correct any delay mismatch. After mixing with the data, the duallevel three-phase signals drive the EC switches and triple the f 2 signal to f 3 = 3 × f 2 carrier frequency [see Fig. 7(d)]. The details of the system implementation are presented next.

III. CIRCUIT IMPLEMENTATION
The top-level full-chip block diagram of the proposed K a-band switched-capacitor RFDAC is shown in Fig. 8. The digital baseband I /Q data are generated off-chip by a digital pattern generator in an FPGA and input to the IC chip via high-speed LVDS buffers. The RFDAC is designed with 2 × 6 bit resolution to support modulation of up to 64-QAM. The resolution choice of 6 bit was driven by the SNR requirements to achieve the 64-QAM modulation while minimizing the capacitor array size. To maximize the RFDAC linearity, it is realized as a fully unary-weighted converter, which helps to minimize both AM-AM and AM-PM distortions. The data are input to the chip in the binary format; hence, a 6-to-63 binary-to-thermometer encoder can control the individual RFDAC slices. Each slice is enabled/disabled via a logical AND with the local DLL-generated clock; hence the AND gate serves as a mixer of the data with the subharmoniccarrier clock, thus performing the first-stage up-conversion of the baseband data. Subsequently, a vector of these ON-OFF modulated clock waveforms is buffered to be strong enough to drive the output stage.
The chip receives a differential reference clock at f 1 = f LO = 2.5-3.5 GHz, which is ultimately multiplied by 9 to reach the mmW carrier frequency range. The reference is injected to the global oscillator, which is a six-stage differential ring oscillator (RO); hence, it has 12 unique output phases. These 12 phases are required so that the correct phases are available for quadrature operation after the frequency multiplication. The proper output phases are selected and fed to the local in-slice DLLs. The edge combing at the input of the DLL network increases the injecting clock signal frequency to f 2 = 3 × f LO . The output of each local in-slice DLL is fed to the corresponding RFDAC bit-slice unit cell, which contains the logic mixer, the driver, and the EC output stage tripler to bring the frequency from f 2 to f 3 = 9 × f LO at the RFDAC output pad (see Fig. 8). The capacitor bank is embedded in the impedance-matching network and the on-chip powercombining transformer (PCT) merges the individual I /Q signal branches. The PCT is chosen to be a parallel-parallel topology, which reduces imbalance in the structure and aids in increasing the isolation between the input channels [28], [29]. The details of the individual blocks are discussed next.

A. Output-Matching Network
It consists of the PCT and the capacitor array to provide impedance transformation from the fixed Z load = 50-output load to match the optimal termination impedance Z opt of the output stage switches. First, the EC-SCPA is optimized for the case of maximum output power, P out,max , when all unit cells are ON. The optimal resistance is defined as [2] Using the value found in (9) and assuming that the network is approximately series-resonant, the loaded quality factor of the network, Q load , can be calculated as follows: (10) where C total represents the total array capacitance of the EC-SCPA. In the presented design, Z opt = 4 ; hence, Q load is approximately 3.4 and C total is approximately 437 fF. From this, the equivalent series inductance L to resonate-out the array capacitance can be found as This inductance can be embedded in the transformer, as shown in the schematic of Fig. 9. The matching network is a bandpass filter that converts Z opt to Z load , as illustrated in the accompanying Smith chart. Note that Q load is limited by the quality factor of the on-chip PCT, Q xfmr , which varies from 11 to 18. To find the optimal capacitance for maximum efficiency, the switch size can also be embedded in the optimization procedure, which was highlighted in [30].

B. Edge-Combining Frequency-Tripling XOR Gate
The schematic of the proposed frequency-tripling XOR-based edge combiner is shown in Fig. 10. Compared to the conventional SCPA switching circuit where there are only two phases required for the operation (e.g., ON/OFF), the tripling circuit requires three phases, as shown in Fig. 7(c). To reduce the switching losses, the closing of the pMOS switches should be in phase with the opening of the nMOS switches and vice versa. The layout parasitic capacitance should be minimized to reduce its effect on the phase mismatch; otherwise, it will increase dynamic power losses due to a so-called "crow-bar" current. In the EC tripler, the switching frequency of each individual transistor is f out /3. Although it is easier to meet the specifications at the reduced switching speed, the proper choice of devices is still important. In this design, the transistor devices of super-low threshold voltage in deep n-well ("SLVT-deep-NW") are used for the device isolation and the ability to control the back-gate bias. Control of the back-gate bias allows to increase the switching speed by further reducing the threshold voltage of the SLVT devices.
Because there are three potential closure paths for both the pull-up and pull-down branches, precise phase accuracy is required to ensure the minimal crow-bar current. The function of the tripling XOR-gate is given as This function can be rewritten as follows: Fig. 11. Schematic of the bit-slice logic mixer, level shifter, and driver. For layout compactness, (13) can be used to find an Euler path. This allows the layout area to be reduced by minimizing the number of active diffusion breaks [31]. The abstract Euler path diagram for the pMOS and nMOS networks is shown in Fig. 10 (bottom). All devices within the nMOS and pMOS networks are matched following the Euler path. Hence, the edge-combining circuit is laid out in the most compact form, which is beneficial for the mmW operation. In addition, the location of MN A/B/C and MP A/B/C eases the layout of the input phase signals to better balance the resistive parasitics, which can further minimize the phase mismatch between A/B/C for both the pMOS and nMOS networks.

C. Bit-Slice Logical Mixer, Level Shifter, and Driver
It is important to drive the EC-SCPA output stage with square-wave signals that are appropriately phased aligned to maximize the efficiency. The driver slice operates at f out /3 and powered from 0.9 V. Each bit slice has three branches supporting A = 0 • , B = 120 • , and C = 240 • , as shown in the schematic of Fig. 11. The tri-phase clocks are applied to their respective NOR gates that act as a logical mixer to mix the LO clock with the single unary bit input. This dictates whether the cell will either switch ON or be grounded. In the EC-SCPA output stage, the pMOS transistors operate between V DD and 2V DD ; therefore, level shifters are added to bring this path to the appropriate voltage domain. To reduce the power consumption, the dimensions of the devices in the NOR gate and level shifter are sized using logical effort scaling to optimize the power versus delay tradeoff. Delay must be carefully balanced in both the nMOS and pMOS paths. The buffered chains are carefully designed with post-layout extraction to ensure that all delay lines are correct to minimize the mismatch effect. This single bit slice is replicated into an array with 64 elements and drives on a bit-by-bit basis the 64 EC frequency-tripling switches.

D. Coupled DLL Network
The carrier clock must be evenly distributed to all 64 slices in the RFDAC core. At lower operational frequencies, this is typically accomplished with a standard H-tree. However, at mmW, this becomes extremely challenging as the carrier clock must arrive synchronously at each slice; also, the power consumption for the global clock distribution would increase. Our proposed solution is to generate carrier clocks locally in each slice via DLLs that are mutually coupled. The DLL is based on an RO as it provides the required phases and is fairly compact [32] to fit in the allotted place.
A standard differential three-delay-stage RO-based DLL is designed with injection locking based on the EC technique. As demonstrated in Section II, to further reduce the power dissipation without distributing the clock, the input reference of the DLL is chosen at f out /9 with a 50% duty cycle in the range of 0-V DD . The input is then applied to the EC stage before interpolating and generating CLK 0 • , CLK 120 • , and CLK 240 • at f out /3. All of these circuits operate at V DD = 0.9 V.
The mmW RFDAC requires 64 LO clock generators that can provide tri-phase CLK 0 • , CLK 120 • , and CLK 240 • from each generator, to drive the 64 RFDAC core unit cells. P out will be degraded if the switching of any of the unit cells is not synchronous or has a poor edge slope. To overcome this challenge, a matrix of coupled DLLs is chosen, as shown in Fig. 12. Here, the coupling elements used for synchronization are resistors. The resistors value is chosen in the range of 20-80 , which is calculated based on the Kuramoto model [33]. Because of the use of matrix coupling network, the input injection signal must be strong enough to lock all 64 DLL units. In this way, all DLLs are synchronized via injection locking, which is explained by the Kuramoto model. Simulations validated that all tri-phase (e.g., CLK 0 • , CLK 120 • , and CLK 240 • ) signals are generated with sufficient accuracy to meet the output power, efficiency, and linearity requirements. The resulting phase errors will primarily reduce the output power and efficiency, as they will result in pulsewidth distortion and crow-bar currents. Secondarily, they can result in an impedance modulation of the switch, which can impact the linearity. The clock generation was validated across PVT to guarantee optimal output power and efficiency, which is also sufficient to guarantee linearity. In simulations across corners, a mismatch of <1 ps was observed across all rising and falling edges of the DLL outputs and the output duty cycle was maintained between 48.9% and 50.1%.

E. Separated Partition Layout
Floor planning plays an important role in RFDAC design, but at mmW becomes critical. Careful layout is necessary to ensure that the cells of the EC-SCPA operate in-phase and do not suffer degradation due to transmission-line (TL) effects. At lower frequency (e.g., sub-6 GHz), the wavelength is long relative to the size of the arrays in an RFDAC, but at mmW, this is not necessarily true. The distributed effects of the TLs can cause amplitude variation across the physical dimension of the array. For designs at this frequency, the lengths of the electrical connections at the same potential are explicitly shortened to be L < λ/8. In addition, the interconnect from the array to the PCT must be accounted for, as impedance variation due to the TL effects transforms Z opt , which can reduce P out and η SE if not accounted for.
In sub-6-GHz SCPAs, the capacitor units were placed in the same direction in relatively large (dimensionally) arrays. This ensured the best linearity [2], [34]. However, applying this layout to mmW designs is not feasible due to the TL effects on the top-plate connection of the capacitors, which would cause voltage variation across the plate. Each small section of TL is modeled with equivalent RLC, as shown in Fig. 13(b). It can be seen that TL effects can change the resonant frequency and impact the performance of the EC-SCPA. To solve this problem, the array is segmented into compact partitions, and the top plate is centrally connected. This minimizes the physical distance between capacitors, thus minimizing the parasitic inductances [35]. The 64-unary-bit array is divided into four partitions, each containing 16 cells. To connect the four partitions, a double TX line with a ground shield is designed to isolate the mmW TX line from the clock distribution network. A 3-D representation of the layout, where the output interconnect is shown in pink and the tapped interconnects are shown in blue, is shown in Fig. 13(b).
In addition to the interconnect, the quality (Q) factor of the unit capacitor is of concern. This is because the Q factor for capacitors at mmW can be degraded and thus impact the efficiency as in the case of inductors if care is not taken. The PDK p-cell capacitors are too dense and the layout results in a low component Q factor; hence, a custom AP-MoM unit capacitor is designed to obtain a high Q factor, as shown in the inset of Fig. 13. The custom MoM capacitor only uses metal C 1 as a shield while only using higher metal layers C 3 , C 4 , and C 5 , to obtain the unit capacitance of 7.5 fF, while Q C , which is simulated to be ≈ 58, is high enough to not degrade the matching network losses. Furthermore, to minimize the TL effects, the top-plate connection is designed in the highest metal layer with a taper.

IV. MEASUREMENT RESULTS
Prototype of the fully integrated K a-Band SC-RFDAC was fabricated in a 22-nm FD-SOI process with 11 metal (including thick and ultra-thick) layers. The entire chip occupies 1.5 × 2 mm 2 , including three pads for GSG RF probing, as shown in Fig. 14(a). To reduce noise coupling between the analog/RF and digital domains (high-speed data I/O and decoders), they are isolated via an undoped isolation ring (e.g., "BFMOAT" guard ring in our process PDK). Such an isolation ring has native doping of the bulk substrate, which is typically p − . The RFDAC core draws high current but requires the power supply to be very stable; hence, eight supply pads are used, with the location well-distributed around the two RFDAC cores.
As shown in Fig. 13, each of the two RFDAC cores is split into four partitions. Each partition contains an array of 16-unary bit slices. All circuits besides the EC output stage operate at V DD = 0.9 V, while the power supplies for the RFDAC output stage and drivers are set at V DD2 = 1.8 V. This is to account for the pMOS devices operating with the level shifting between V DD and V DD2 .

A. Measurement Setup
The prototype chip is directly mounted on a PCB as "chipon-board," in which all pads, except for the mmW output, are wirebonded. The entire assembly is placed on a Cascade M150 probe station, as shown in Fig. 14. The generated mmW output at the GSG pads is wafer-probed via MPI TITAN-T40A with a maximum insertion loss of −0.4 dB from 27 to 33 GHz. The probe output is then connected to a high-performance cable with a −1.4-dB loss before it is fed to the spectrum analyzer and the high-speed real-time oscilloscope for observation. The carrier clock input is provided by the Keysight N5173B Analog Signal Generator, which can support up to 20 GHz. The single-ended input is set between 2.5 and 3.5 GHz, with a default of 3.11 GHz. It is converted to differential one via the Marki Microwave Balun and passed through programmable phase shifters (not shown) and the bias tee to adjust its common-mode level. This helps to ensure proper injection locking of the on-chip oscillator. The baseband digital data are generated by the FPGA Ultrascale+ VCU, which is  synchronized by a 10-MHz clock reference from SMA100A, feeding also the input LO for synchronization.

B. Static Measurements
The measured output power P out at 27.9 GHz as a function of amplitude codeword (ACW), with the corresponding drain efficiency η as a function of P out , is shown in Fig. 15(a) and (b). The peak P out is 21.2 dBm at the maximum ACW. At the peak P out , the drain efficiency η = 36.7%, while at a 6-dB back-off, η ≈ 19%. The system efficiency, which incorporates power losses due to various blocks in the clock carrier distribution, is of paramount importance. The efficiency is broken down by the cumulative contributions and plotted as a function of ACW and P out in Fig. 15(c) and (d), respectively, to investigate the primary sources of degradation when compared to the drain efficiency, which only includes the output stage. The power consumption by the driver causes the primary degradation in η SE , where the peak η driver (clock buffering and clock generation are not yet included) is ≈ 27.6%. When including the remaining system blocks (e.g., clock buffering, digital logic, and clock generation), the total peak efficiency η SE falls to 22%. It should be noted that this RFDAC is not just a PA but also provides the DAC and mixer functionality, thus comprising a full transmitter front end; hence, its η SE is very competitive, even when compared to standalone PAs. The static level of RF output voltage versus ACW is plotted in Fig. 15(e). Its linearity can be quantified in terms of the integrated and differential nonlinearity (INL and DNL), as shown in Fig. 15(f) and (g). While delivering the peak  output power of 21.2 dBm (∼7 V pp ) at 27.9 GHz, the INL and DNL achieve excellent performance with +2/−1 LSB and +0.5/−0.5 LSB, chiefly due to the unit-weighted arrangement.

C. Dynamic Measurements
To validate the claimed potential of performance for modern wireless communication systems (e.g., 5G), several nonconstant envelope-modulated signals are applied to evaluate the dynamic linearity.
The proposed RFDAC is measured using a 64-QAM signal across different modulation bandwidths (BW) at a center frequency of 27.9 GHz. The measured results are shown in Fig. 16. The EVM for BW = 100/200/400 MHz is, respectively, 3.15%/3.38%/3.32% rms. The respective average output power is 15.3/14.9/14.7 dBm, while the average system efficiency is 19.6%/18.8%/18.1%. All measurements are without any DPD applied. The performance remains stable up to BW = 400 MHz, which is limited by the output rate of the FPGA that acts as the pattern generator. The digital pattern is re-sampled on-chip to the input clock rate (i.e., f c /9), which results in spectral images at this offset frequency from the f c carrier. Though not implemented in this prototype, spectral images can be further mitigated using additional digital interpolation or analog linear interpolation [36].
The measured results for the 16-QAM (1.6 Gb/s) and 64-QAM (2.4 Gb/s) signals are shown from 28 to 30 GHz in Fig. 17. The output is tuned at 27.9 GHz and, hence, the performance is best at the center frequency. There is a slight rotation of the constellations resulting from the measurement instrument not fully nulling out the static phase offset. This is captured in the measured EVM. It is noted that the measured EVM and efficiency maintain good performance even when offset from the optimal tuning point. Although we were not able to increase the bandwidth beyond 400 MHz due to the limitations in our FPGA, we anticipate that memory effects caused by the supply network and output-matching network would limit the data rate. This is similar to the mechanisms that appear to malign all PAs. We note that, as with other PAs, these effects can typically be mitigated with a DPD.

V. DISCUSSION AND CONCLUSION
This article presents the first ever K a-band switchedcapacitor power-amplifier (SC-PA)-based RFDAC using an edge-combining technique. The proposed frequency tripler embedded in the output stage allows for the switching transistors to properly operate at mmW frequencies while maintaining good efficiency and linearity at >20-dBm output power. A comparison to prior art is shown in Table I. The work compares favorably to [19], which was an RF power-DAC operating at similar frequency and resolution. It also compares favorably to the mmW outphasing [17], [37], [38], mmW Doherty [39], [40], and linear mmW PAs [41], [42], noting that the proposed implementation delivers high output power while including the DAC+mixer functionality, whereas the other comparisons include fewer features. The presented approach is promising for other mmW hard-switching architectures as well, as frequency multiplication can be embedded into the output stages in Doherty and envelope elimination and restoration transmitters.