Terahertz Integrated Circuits and Systems for High-Speed Wireless Communications: Challenges and Design Perspectives

This paper presents challenges and design perspectives for terahertz (THz) integrated circuits and systems. THz means different things to different people. From International Telecommunication Union (ITU) perspective, THz radiation primarily means frequency range from 300 – 3000 GHz. However, recently, a more expansive definition of THz has emerged that covers frequencies from 100 GHz to 10 THz, which includes sub-THz (100 – 300 GHz), ITU-defined THz frequencies. This definition is now commonly used by communication theorists, and since this paper is intended for people with a wide variety of expertise in system and circuit design, we have adopted the latter definition. The paper brings to the open unmitigated shortcomings of conventional transceiver architectures for multi gigabit-per-second wireless applications, unfolds challenges in designing THz transceivers, and provides pathways to address these impediments. Furthermore, it goes through design challenges and candidate solutions for key circuit blocks of a transceiver including front-end amplifiers, local oscillator (LO) circuit and LO distribution network, and antennas intended for frequencies above 100 GHz.


I. INTRODUCTION
T HE EXPANDED definition of terahertz (THz) band from 100 GHz −10 THz has emerged as part of the professional and public consciousness due to emergence of exciting applications including active and passive sensing/imaging as well as forthcoming generations of high data-rate wireless communications [1]. In the area of wireless communications, which is the scope of this article, mobile networks with nomadic distributed base-stations using unmanned aerial vehicles (UAVs) are expected to become progressively more prevalent in the future society as complementary part of ever-evolving wireless networks, connecting billions of people across the globe and an even higher number of immobile/mobile cyber devices scattered in the environment. At the same time, the data rate supported by mobile devices keeps increasing with the deployment of next-generation networks, e.g., 5G, and upgrades of existing infrastructures. Consequently, enormous amounts of data traffic will be generated on a daily basis and data exchange between base stations and the backbone network through conventional backhaul links will quickly become a bottleneck. On the other hand, the operation of mobile terminals continues to be hindered by ever-increasing interference problems in a congested environment. Fig. 1, from [2], shows the data-rate growth over the years for three communication protocols, namely, cellular, WLAN, and short wireline links, projecting linear growth fueled by the user demand.
This plot implies that continuing growth of world's population together with worldwide access to Internet and general public tendency to use bandwidth-intensive applications are major driving force for enhancement (and revamping) of wireless infrastructure so as to meet these demands. Indeed, COVID-19 pandemic outbreak in 2019-2022 period and the explosion of on-line video-communication services has only accelerated this demand. The end-users need for wider

FIGURE 2. A generic distributed network comprising small cells and local
base-stations that can also act as relay nodes.
bandwidth calls for a paradigm shift in the way wireless infrastructure is being designed and deployed to enable spectrally efficient wireless communication [2], [3]. Along with the demand for more bandwidth comes the desire for more computing and storage resources provided by large-scale data farms. The wired networks in such data-centers face severe over-subscription and hot-spot problems [4], [5]. On-demand flexible wireless links greatly alleviate these issues.
One way of conforming with the need for reliable high data-rate connectivity involves the deployment of distributed base stations with massive number of antennas (>100) providing high-speed wireless access to multi co-channel users (Fig. 2). Considered as an alternative solution to the currently used centralized wireless networks, each distributed base station in this scenario acts as a relay node within a large relay network, as shown in Fig. 2. To establish a point-to-point or point-to-multi-point line-of-sight wireless link, the relay nodes should exchange large data volumes rapidly, justifying an essential need for 50+ Gbps transceivers.
The use-scarcity of mm-Wave frequencies from 30-to 300−GHz, especially high side of this range from 100− to 300−GHz has motivated research and development teams across the globe to investigate future wireless communication networks for data rates beyond what is achievable by 5G [6]- [37]. At the first glance, it may appear that operating in the THz band should alleviate an important design concern associated with conventional wireless links, namely, a demand for very sophisticated modulation schemes (e.g., 1024 or higher-order quadrature amplitude modulation (QAM)) in order to boost the communication speed at the commonly used RF carrier frequencies (i.e., ≤ 10 GHz). As an example, binary phase-shift keying (BPSK) modulation of high-speed data with 5−GHz bandwidth on a 120−GHz carrier frequency can potentially yields 5 Gbps data rate. This means that operation in THz frequency range can provide wide RF spectrum with a fairly small fractional bandwidth (e.g., ∼10%), which is quite attainable by integrated transceivers fabricated in standard silicon technologies.
Although operation at even higher frequencies (e.g., above 300 GHz) can offer wider bandwidth and thus higher channel capacity, however, the limited transistor's maximum operation frequency (f max ) sets a performance upper-bound. Heuristically, frequencies up to ≈ f max /2 would be a range where the active devices exhibit sufficient gain and generate acceptable RF power, as will be discussed in Section II-B. Referring to the performance of the most advanced commercially available silicon (Bi)CMOS transistors, this range lies somewhere within 100−300 GHz frequency range [38]- [40]. This assertion indicates that the use of modulation schemes offering high spectral efficiency in conjunction with a wideband radio/modem should instigate a pathway toward deployment of high data-rate THz transceivers. Even though high-speed wireless radios using conventional homodyne or heterodyne architectures have been disclosed lately [6]- [19], [24]- [37], [41], the inputs/outputs of these radios are still in the form of modulated baseband or intermediate frequency (IF) signals. To procure raw information bits, high-speed and high-resolution mixed-signal blocks are needed. The sampling rates of these data converters need to be at least two or four times the baud-rate of the modulated baseband and IF signals, respectively, so as to ward off aliasing issue. As will be explained in Section II-D, reducing the complexity of data converters and back-end DSP is of utmost importance, as it facilitates lowpower and cost-effective high-speed wireless links for mass consumer market. One can argue that designing an integrated ultra-high data-rate (e.g., above 50 Gbps) wireless transceiver would be practically impossible due to excessive amount of power -as high as 10 W -consumed by data converter and baseband units unless totally new architecture-level solutions are explored. This power consumption problem will only exacerbate if a multi-antenna architecture rather than a single-element transceiver at 100+ GHz is to be designed.
One effective way of boosting the data-rate, link reliability, co-channel user service, and combating path loss is to employ multi-antenna architectures [42]- [46]. Increasing the number of antennas results in channel hardening and reduction of small-scale fading (less multi-path and Doppler spread), which in return simplifies baseband signal processing algorithms. Various configurations of multi-antenna architectures provide: (1) multiplexing gain to enhance link capacity through concurrent transmission of parallel data/user streams, (2) diversity gain to improve reliability of wireless links especially in non line-of-sight (NLOS) scenarios through transmission of copies of the same data stream, (3) antenna gain to combat path loss, integrated wide-band noise, and co-channel interference through breamforming in LOS or directed NLOS scenarios. The modern multiantenna system should provide multi-functionality beyond beamsteering or signal-to-noise-ratio (SNR) improvement of a conventional phased-array system. We will discuss this notion, in more details, in Section III-A. This paper makes an attempt to study THz transceivers from both system-as well as circuit-level perspectives.

II. KEY PERFORMANCE PARAMETERS
When it comes to the design of THz transceivers, three major performance parameters should be taken into consideration. Operation in THz frequencies offers huge untapped frequency bandwidth. However, increasing bandwidth per user incurs several design challenges, which will be illustrated in Section II-A. Link reliability, coverage, and throughput are important parameters that mandate careful attention at THz frequencies. Section II-B will briefly go through the concept of multi-antenna communication as an effective way of increasing throughput, reliability, and coverage and its design challenges. Moreover, wireless communication at high frequencies should combat propagation loss, which increases exponentially with frequency. Section II-C will discuss the communication range and its associated design challenges. The communication speed and link performance can further improve with the aid of higher complexity digital modulation schemes. Section II-D will briefly discuss design challenges associated with high order modulation for THz transceivers.

A. BANDWIDTH INCREASE AND DESIGN CHALLENGES
Shannon theory predicts that wider bandwidth linearly increases the channel capacity in the bandwidth-limited regime, and has negligible impact on capacity in the powerlimited regime. While bandwidth increase would be a straightforward way to boost data rate, other factors constrain its benefit when one accounts for challenges of wideband transceiver design. A wideband design requires the transmitter/receiver RF chains to satisfy high performance over wide bandwidth. For transmitter side, this includes satisfying high gain, high power and efficiency, high linearity, low error vector magnitude (EVM). Likewise, the receiver frontend should demonstrate low receiver sensitivity, low noise and high gain/linearity over wide bandwidth.
It is noteworthy that RF design is often revolved around narrowband operation about a tuned center frequency. Exceeding above 20% fractional bandwidth calls for circuits that should employ high-order bandpass matching circuits. Shown in Fig. 4(a)-(b) are examples of a narrowband tank and wideband double-tuned circuits, respectively, with their respective magnitude and phase responses. Clearly, the phase response of the double-tuned circuit is highly nonlinear compared to that of the narrowband tank circuit particularly when the two resonance frequencies are far away from one another for wider bandwidth. The phase response can no longer be assumed linear or constant, thereby introducing phase distortion. Adding to these challenges is the notion that constituent active devices within a wideband circuit exhibit frequency-dependent characteristic and nonlinearity, which leads to large distortion and in-band noise integration (and thus, SNR degradation) over a wide bandwidth. Therefore, circuit design with large fractional bandwidth more than 30% may not be a proper design strategy for data-rate increase.
One obvious way of keeping the fractional bandwidth within this range, while boosting the absolute bandwidth, is to increase carrier frequency. Besides, achieving wide bandwidth (while maintaining a relatively small fractional bandwidth), increasing the carrier frequency leads to smaller passive size and dimension. Most notably, the antenna size and spacing will decrease, making it possible to design multi-antenna architectures with large array size that improve diversity and spatial multiplexing. Despite advantages, one cannot keep increasing the carrier frequency due to a number of reasons related to principle of propagation and silicon technology limitation. Transceivers implemented in III-V semiconductor technologies with f max > 1 THz have achieved ultrahigh data rates at low-THz range of frequencies [12]. Nonetheless, these technologies are not deemed suitable for integrated systems incorporating large antenna array due to low yield and integration density as well as high fabrication cost. Silicon technologies, on the other hand, present much higher level of integration and may be considered as platform of choice for mass-marketing of ultra-high-speed transceivers. CMOS/BiCMOS transmitters/receivers operating in THz band have been demonstrated by prior work [12], [16], [18]- [21], [23]. However, increasing the carrier frequency is limited by the device f max . It should also be pointed out that MOS transistor's cutoff frequency does not keep increasing with device scaling and seems to be peaking at 45 nm feature size, as has been clearly indicated in the plot of Fig. 3. Importantly, Fig. 3 also reveals that SiGe BiCMOS, with its cutoff frequency gracefully increasing with device density, seems to be a better technology  for integrated THz transceiver compared to a silicon CMOS process.
Based on what has been stated, operation around and below f max /2 is considered to be a sweet spot for THz wireless systems so that high performance (e.g., high-gain, low-noise, and high output power and power-added efficiency) can still be achieved. Given an f max of around 350-500 GHz for commercially available silicon technologies, frequency range around 100-200 GHz could be spectral range of interest for tens of Gbps wireless speeds.
Moreover, designing THz transceivers with 20-30% fractional bandwidth would entail yet another challenge, which is best comprehended by looking at a frequency response of a neutralized 32-nm SOI differential-pair device for different C n values and its Mason's U [16] in Fig. 5. In light of the degrading magnitude response, a matching circuit of higher-order is needed to establish a flat frequency response over the bandwidth of interest. A more detailed study laying down steps to design a wideband THz amplifier will be provided in Section IV-A.

B. INCREASE IN THROUGHPUT/RELIABILITY AND DESIGN CHALLENGES
Increasing the capacity and reliability of wireless communications systems through the use of multiple antennas, first discovered in [47], has been an active area of research for the past 30 years. The well-known MIMO capacity of an N-element transceiver is [48]: where C denotes the capacity, I N is the identity matrix of size N, det [.] indicates the matrix determinant, and H is the channel transfer function, representing transfer functions h ij from the j th transmit antenna to the i th receive antenna. It is noteworthy that multi-antenna architectures can also be used to obtain array and diversity gain in addition to capacity gain. Diversity combining exploits the fact that independent signal paths have a low probability of experiencing deep fades simultaneously. Thus, the idea behind diversity is to send the same data over independent fading paths. These independent paths are combined in such a way that the fading of the resultant signal is reduced, leading to higher reliability of communication. For an N-element multi-antenna transceiver, a maximum diversity gain of N 2 can be achieved and the average probability of error decreases with 1/SNR N 2 [49], thereby improving EVM. The fading problem for line-ofsight communication established by THz transceivers is not an issue, thus, the diversity-gain attribute may not be as essential as in RF frequencies.
In low-SNR THz channels, increasing the capacity is limited by both the transmitter output power and receiver integrated noise. Adopting beamforming multi-antenna architectures to transmit sharp beams with highly directional antenna gains enhances the capacity by improving the SNR, thus making it possible to employ high-order modulation schemes to achieve higher spectral efficiency. While the SNR improves directly with array size, the beam also becomes increasingly more directive, leading to narrow beam antenna pattern.
On the other hand, in high-SNR THz channels with high diversity or rank order, exploiting multiplexing gain via propagation of independent signal streams through multiple distinct paths in different spatial and polarization domains can further enhance the channel capacity or multi-user service. In general, knowing the channel state information on the transmit side (CSIT), one can find optimum power allocation across antennas. To approach the capacity limit, a knowledge of CSIT is required. CSI acquisition can, however, be very costly. Furthermore, increasing the number of antennas for mm-Wave channels requires expensive spectral resources during CSI determination. At mm-Wave frequency, CSI error due to pilot contamination is highly suppressed and CSI can be acquired based on ray-tracing model by estimation of AOD (angle of departure), AOA (angle of arrival), and paths gains. Additionally, mm-Wave channel exhibits spatial/angular sparsity and the number of resolvable paths for both indoor and outdoor communication is very low (i.e., less than four) resulting in a low-rank channel response matrix. Therefore, due to low number of detectable paths, parameterized techniques such as interference cancelation (interference of clusters on each other) in addition to MUSIC algorithm [50] can be utilized to distinguish between multi-paths (parallel data streams). Thus, DOA (direction of arrival) and LS (least squared) methods can be used to estimate paths directions and paths gains, respectively.
At THz frequencies, a single-element wireless link operates in the power-limited or low SNR regime. As mentioned above, to foresee the advantages of MIMO spatial multiplexing on capacity increase, we can employ beamforming to form beams, and thus, increasing the SINR. This requires large transceiver array, and hence the notion of massive MIMO. Indeed, as N grows to be large value (e.g., 128element array), the MIMO channel capacity in the absence of CSIT approaches C = N ×BW log 2 [1+E s /(N ×N 0 BW)] and hence grows linearly with N. One can leverage the benefits of both spatial multiplexing and beamforming concurrently, as will be discussed in Section III-A. For example, it is possible to use multiple beams, where each beam employs beamforming to increase SNR in power-limited situations, while also providing unique data streams on each of the beams using the same carrier frequency [51]. As mentioned above, beamforming leads to highly directive radiation, which makes the transmitter-receiver re-alignment a challenging task.

C. COMMUNICATION RANGE AND DESIGN CHALLENGES
Communication range, R, is markedly affected by carrierfrequency scaling in multiple ways. First, the path loss, PL, increases at a rate proportional to square of frequency, f , and R. The received signal power P RX predicted by the Friis transmission equation (2) is degraded by the path loss and polarization mismatch.
where P R , P T , G R , G T denote the received/transmitted powers and received/transmitted antenna gains, respectively. c is the speed of light and PLF is the antenna polarization loss factor. Although increasing transmit power and transmit/receive antenna gains help increase the range, each comes with its won constraints and limitations.
To understand the underlying challenges behind increasing transmit power, suppose the last power amplification stage employs a differential neutralized circuit surrounded by input (output) matching networks with impedance transformation ratios of n in (n out ) and the power losses L in (L out ), as shown in Fig. 6. The saturated output power, P sat , of this stage is readily calculated, as follows: where G max denotes the maximum available power gain (MAG), which, for a neutralized device holding unconditional stability with K-factor K f > 1, is equal to pair in terms of device f T and f max is: (4) is still valid for slightly over-neutralized device when C n is slightly bigger than C gd . P 1dB in (3) for a quasi-differential pair, like the one in Fig. 6, assuming short-channel MOS model, is [52]: where μ 0 is the low-field mobility, θ is the fitting parameter accounting for mobility degradation, and ν sat is the saturated velocity. Equations (3), (4), (5) present trade-offs between P sat , operation frequency, and device dimension. Specifically, they imply that at high frequencies, the capability of the power amplifier to generate suffiently high P sat is limited by lack of gain, G max , and its linearity. On the other hand, linearizing the device by increasing the device channel-length adversely affects the MAG, thus compromising P sat . One way of increasing the transmit power, and thus increase the range, is to increase overall antenna gain though transceiver array. The amount of radiated power increase at the broadside of an N-element array could be as high as N times the single-element antenna [53]. This, however, comes at the cost of a highly directive radiation.

D. HIGH-ORDER DIGITAL MODULATION AND DESIGN CHALLENGES
Increasing the modulation complexity improves spectral efficiency, thereby resulting in higher data rate for a given specific bandwidth. If so effective, why not keep increasing the modulation complexity (e.g., 2048QAM, 4096QAM and etc) at 100+ GHz center frequencies? To better understand the underlying challenges, we should revisit the structure of modern transceivers handling high order modulation schemes. Modulation and demodulation in state-of-the-art transceivers are done in digital domain, which means  the entire mixed-signal, analog baseband, and RF chain should be able to process modulated signals with high peak-to-average power ratio (PAPR) and dynamic range. Furthermore, realization of higher-order modulation requires (a) local oscillators with lower phase noise, (b) data converters with higher resolution, and (c) RF chain with high dynamic range (accounting for both high sensitivity and linearity), while operating at THz frequencies.
Generation and processing of high-speed high-order modulation particularly exert stringent requirement on the mixedsignal (i.e., analog-to-digital converter (ADC) on the receive side and digital-to-analog converter (DAC) on the transmit side). The sampling rate of a Nyquist-rate data converter is chosen to be 5-to 6-times the baud-rate, in practice, to improve SNR and bit error-rate. Furthermore, the required data-converter resolution is increased with the modulation complexity, which becomes increasingly more challenging to attain at higher data rates. For instance, a 16QAM receiver targeting a bit-error rate (BER) of 10 −4 requires a minimum resolution of around 6 bits to capture degradation due to thermal noise and component mismatch. Assuming that this receiver is designed to operate at 50 Gbps, the sampling rate of the data converter, designed to operate at 5 times the baud rate, is 62.5 GS/s. To better appreciate the challenges of designing such data converter for a low-power transceiver, we have summarized state-of-the-art high-speed DAC and ADC performances in Tables 1 and 2. It is noteworthy that the power dissipation reported for all these data converter prototypes exclude the I/O and clock buffers, digital calibration and clock generation circuits. Moreover, these prototypes do not include on-chip memory. Nevertheless, the effective number of bit (ENOB) and power dissipation are considerably compromised at higher sampling speeds. All these requirements should be met at the reported 50+ Gbps data rate.
Besides trade-off between signal-to-noise+distortion ratio (SNDR) and speed, the data converter's power consumption represents another major issue for a THz transceiver, as power dissipation is a super-linear function of frequency. As an example, according to a recently published work [55] summarized in Table 2, the power consumption of a CMOS More importantly, a study in [61] predicts that power consumption of clock generators for ADCs will be increasing quadratically with their speed and resolution, thus becoming significant at the speed of interest. For example, the lower-bound of power consumption of a phased-lockedloop (PLL) clock generator fed by 50-MHz external crystal oscillator with an excellent phase noise of -150 dBc/Hz at 1 MHz offset for a 6-bit 62.5 GS/s ADC is around 1.8 W. Adding this value to the power dissipation of the core ADC, the mixed-signal block alone can consume multiwatts of power, rendering conventional direct-conversion or low-IF transceivers impractical for THz wireless communication. Depicted in Fig. 7 are comprehensive surveys of published (Bi)CMOS ADCs [62] in two forms. In the plot demonstrating relative noise floor achieved by the ADCs appeared in publications between 1995 to 2018, a lowerlimit of -160 dBc can be identified. The plot of Walden figure-of-merit versus Nyquist frequency of the same groups of ADCs shows a lower limit for the power per conversion step which becomes worse with sampling rate. This explicitly means that higher sampling rate ADCs are incapable of achieving an arbitrarily high SNDR and ENOB. It is also noteworthy that technology scaling does not seem to mitigate this trade-off between bandwidth and SNDR.

III. ARCHITECTURE LEVEL TECHNIQUES
Section II elaborated that increasing bandwidth, modulation order, and transmit power face fundamental barriers. This section will go through two architectures -namely (a) multi-antenna systems and (b) transceivers implementing modulation/demodulation directly in analog/RF domainsthat overcome issues and challenges discussed in Section II.

A. MULTI-ANTENNA ARRAY ARCHITECTURES
The smaller size of passive components at THz frequencies makes it possible to think of having integrated multi-antenna transceiver arrays. Improvement of capacity and reliability of wireless communication systems through the use of multiple antennas has been an active area of research for over 25 years. Discovered by Paulraj and Kailath [47], MIMO wireless systems are now part of current standards and have been widely deployed for public use [63].
Much of the research effort on multi-antenna transceiver design has been centered around beamforming through phased-array implementations [64]- [70]. Implementation of the first phased-array system in silicon incorporated the LO phase shifting [64]- [67], as shown in Fig. 8(a). The primary advantage of this architecture is that the phase-shifters are placed away from the RF path. Therefore, they process single-tone LO signals rather than an RF signal and their insertion loss will have no effect on the RF power. Alternatively, a phased-array transceiver with RF phaseshifting in Fig. 8(b) relaxes the LO distribution network, which can be a considerable design challenge for large array sizes [68]- [71]. Phased-arrays, however, offer limited features of a multi-antenna architecture such as improvement of capacity and diversity.
An all-digital beamforming system [72], [73], shown in Fig. 11(a), enables multi-beam communication with the highest adaptability and data-rate. The digital beamforming (DBF) approach offers three major advantages [73]: (1) high magnitude and phase resolution can be achieved by digital precoding. (2) A DBF array can be used to superpose multiple beams for several data streams, thereby resulting in higher capacity. (3) For multicarrier signals, such as orthogonal-frequency-division-multiplexing (OFDM) signals, the fully DBF architecture can realize independent beamforming precoding at each subcarrier or resource block to obtain extraordinary performance at a wide signal bandwidth. On the other hand, a digital beamforming array demands dedicated RF chains for each antenna element and high dynamic-range for RF front-end. Moreover, the need for additional signal processing to facilitate multi-beam transmission as well as interference management in a multi-user environment mandates the requirement of digital baseband precoding and combining.
A possible approach to address the complexity and excessive power consumption issues in a conventional MIMO system is the code modulated path sharing multi-antenna (CPMA) architecture [74], [75], in which code multiplexing is used to combine the signals emerging from multiple antennas into a single RF/IF/baseband/ADC path. The primary advantages of CPMA include (1) a significant reduction in area and power consumption, and (2) amelioration of crosstalks and power losses of large LO routing/distribution network used in massive-MIMO transceivers. Fig. 9 shows an exemplary block diagram of the CPMA transceiver applicable for a massive MIMO base-station [42]. The CPMA architecture subdivides the entire antenna array of size N into groups of M antennas, and uses code multiplexing to combine the M signals onto a single RF path. The individual signals are easily extracted using a code-demodulation in digital domain. Likewise, on the transmit side and prior to transmission, the signal for a given antenna is extracted using an RF code demodulator. The CMOS development and integration of this idea in [75] used mutually orthogonal Walsh-Hadamard code sequence due to its ease of implementation and its ability of achieving maximum MIMO capacity [74]. This M-fold reduction of RF/IF paths and ADC/DACs, however, at the price of M-fold increase in bandwidth and sampling-rate ADC/DACs if orthogonal codes are being used. It turns out that non-orthogonal codes can provide an acceptable trade-off between capacity and bandwidth [74]. Fig. 10 demonstrates the plot of capacity versus code correlation coefficient, ρ (ρ = 0 for orthogonal codes), of a 4×4 MIMO system four SNR values. It is evident from  the plots of Fig. 10 that at high SNR's the capacity is affected by ρ more substantially, whereas at low SNR's the capacity is noise limited and is less affected by ρ.
Alternatively, a hybrid architecture, depicted in Fig. 11(b), with both analog beamforming and digital MIMO coding has been pursued recently, as it reduces the complexity of the digital baseband with a smaller number of up/downconversion chains in systems with massive number of antennas, thereby emerging as a viable candidate for both outdoor and indoor mm-Wave/THz communication. The multi-beam digital baseband processing with analog beamforming facilitate both multiplexing and beamforming gain. Hybrid architectures can be designed to receive (or transmit) all data-streams from all antennas in Fig. 11(b) (when N = N RF ), or receive (or transmit) only a subset of datastreams, N RF with N RF < N, per each antennas leading to a sub-array system. In Fig. 11(b), the complex weighting coefficients are generally defined as W i,k = A ik e jφ ik for i ∈ {1, . . . , N} and k ∈ {1, . . . , N RF }. φ ik and A ik are realized by RF phase shifters and variable gain attenuators and/or amplifiers (VGAs), respectively. The RF phase shifters are used for main-lobe steering, whereas the RF VGAs enable spatial filtering of interference by placing the null locations of each beamforming path toward the directions of the interference incident angles. The number of required RF chains N RF in a hybrid architecture is strictly lowerlimited by the number of parallel data streams K, while beamforming gain is determined by N RF complex weighting coefficients emerging to each antenna in Fig. 11(b). In retrospect, a full-array realizes the function of an all-digital architecture. The number of signal processing paths (from the digital baseband to the antenna front-end) for the subarray is equal to N RF × N and for full-array is equal to N 2 . On the other hand, beamforming gain of the sub-array is N RF /N of the full-array. Therefore, a trade-off exists between signal processing complexity and beamforming gain of hybrid architectures. A recent circuit implementation of a hybrid architecture was presented in [76]. It utilizes Cartesian combining concept to enable 2-stream reception. One important consideration in this design is that its implementation requires 8 splitters, 20 combiners, and 12 mixers for a 2-stream reception. These large number of signal paths introduce electromagnetic cross-talks due to many cross-overs between these paths. Later, [77] unfolded a partially-overlapped beamforming-MIMO architecture capable of achieving higher beamforming and spatial multiplexing gains with lower number of elements compared to conventional architectures. Reference [77] showed that overlapping the clusters in an N-element hybrid allows us to allocate larger number of antennas per cluster, thereby resulting in higher beamforming gain compared to the corresponding N-element sub-array.

B. LOW-POWER DIRECT-RF-(DE)MODULATION TRANSCEIVERS
As discussed in Section II-D, research works in ultra-high speed transceivers have not addressed an essential question: what are the power efficient solutions for DAC/ADC and DSP components of integrated transmitter and receiver chipsets that can handle data rates above 50 Gbps? More precisely, in prior work targeting these applications, the entire back-end and mixed-signal processings are carried out by an expensive commercial Arbitrary Waveform Generator (AWG) and a real-time oscilloscope to generate high-power sub-channelized modulated signals off-chip feeding the transmit side, and to equalize/demodulate/synchronize the RF signal followed by extraction of the baseband stream on the receive side. A front-end with external AWG and realtime scope is certainly not practical let alone being a power-efficient solution. The situation will only become far more severe if we think of extending a single element transceiver to a multi-antenna architecture for the same ultra-high data-rate application domain. While the achieved data-rates by prior work (e.g., [12]) are impressive, there is no discussion on how to implement high-speed signal generation and (de)modulation. In fact, the mixed-signal design challenges for these systems are unresolved. Specifically, while transceivers incorporating higher order modulations operate at smaller RF bandwidth for a given data rate, they require significantly higher resolution and higher DAC/ADC sampling rate compared to the signal baud-rate.
References [20], [21], [23] entertained the idea of highorder (de-)modulation directly in RF domain. Shown in Fig. 12(a) is the block diagram of two commonly used transmitter architectures, i.e., direct conversion and heterodyne. Notable in both structures is the fact that (de)modulation should be handled by DSP. As pointed out in Section II-D, DAC and DSP, handling most of the signal processing including pulse-shaping and high-order modulation (e.g., 64QAM), should operate at 50+ GS/s for 100 Gbps wireless communication. Notwithstanding is the fact that DAC resolution increases with modulation order, adding higher complexity (and thus higher power dissipation). Similar issues will arise if conventional direct-conversation or heterodyne receiver schemes in Fig. 13(a) are used for high-speed wireless communication. On the other hand, delegating the modulator/demodulator function to the analog/RF domain, as indicated in 12(b), will lead to whole new generation of architectures that are amenable to higher speeds. In fact, assuming a powerefficient transmitter/receiver solution with such capability exists, great advantages readily come to fruition: (1) Powerhungry high resolution and high data rate ADC and DAC will be removed. (2) The complexity of the baseband blocks will be significantly relaxed.

1) DESIGN CHALLENGES: PULSE SHAPING, EQUALIZATION, CARRIER SYNCHRONIZATION
Pulse shaping is commonly employed in a wireless transmitter to mitigate the problem of excessive bandwidth. As stated in [23], pulse shaping with root-raised cosine (RRC) filters requires multi-bit resolution DAC with a sample rate of more than twice the baud rate. Putting aside the daunting challenges of building such DACs and digital filters at ultra-high speeds, the fundamental role of pulse shaping is revisited first. Correlation detection, or equivalently matched filtering, is at the core of any communication system to maximize SNR before the decision making circuitry in the receivers. Any pulse shape (not limited to RRC) theoretically have the same SNR performance under the same noise power density N 0 [78] so long as pulse shapes are matched in the transmitter and receiver. This degree-of-freedom allows us to design a pulse-shaping filter in analog domain (cf. Fig. 12(b)).
Equalization in any type of communication modality is crucial, as the channel impairments severely degrade the BER. The bandwidth of free space channel is typically wide enough and the bandwidth limitation primarily comes from transmitter front-end. Therefore, circuit techniques that can achieve very flat frequency response across a wide bandwidth are the key in pushing the limit of data rate. In the context of the proposed direct modulation/demodulation-based transceiver architecture, analog equalizers and bit recovery/retiming circuits succeed the RF demodulator, as shown in Fig. 13(b), in a similar way as in broadband wireline receivers [79]. The clock recovery circuit in Fig. 13(b) generates the clock signal for symbol synchronization and retiming. One distinction worth mentioning is that, in the context of wireless systems, complex-domain equalization techniques may be necessary to account for the asymmetry in lower and upper sidebands induced by RF building blocks such as PA. All in all, design considerations on equalization may not be one-dimensional when multiple practical limitations are accounted for. Similar to a wireline transceiver, we can employ both transmitter side and receiver side equalization. On the transmit side, a low-order feedforward equalizer can be used to equalize channel and attenuate pre-cursor intersymbol interference (ISI). On the RF-demodulation-type receiver side handling high-order modulation, the bit extraction and recovery following I/Q downconversion will be

FIGURE 14. (a) Conceptual block diagram presenting RF-8PSK modulation concept and simplified direct RF-8PSK modulation transmitter [21]. (b) The signal space of 8PSK symbols partitioned using 8 LO phases, and conceptual block diagram of an RF-8PSK receiver incorporating multi-phase RF-correlation [20], sign-check comparison, carrier synchronization, and symbol recovery.
performed on a multi-level PAM (PAM-M) signal [20]. This means the equalization should be able to improve horizontal and vertical eye-opening, which indicates that an M-step bit-by-bit equalization can be a plausible method.

2) CASE STUDIES
We briefly go through three case studies, namely, and RF-8PSK transmitter, an RF-8PSK receiver, and an RF-QAM transmitter.
RF-8PSK transmitter: Aiming for a bits-to-RF transmitter and an RF-to-bits receiver that can overcome the aforementioned challenges, we investigate direct RF modulation and demodulation, which has so far been employed only for OOK and QPSK schemes, to construct higher order modulations. One immediate extension would be an RF-8PSK modulator and demodulator. To do so, we start with a QPSK modulator and introduce additional level of phase modulation to the QPSK output so as to create two versions, a QPSK and an Offset-QPSK constellation, and enable one of these schemes based upon the status of the third bit (cf. Fig. 14(a)) [21]. To avert wideband RF phase-shifters at THz frequencies (e.g., 170 GHz in [21]) and mitigate their significant insertion loss in the RF path, this additional phase-modulation is deployed in the LO path and placed prior to quadrature mixers, as shown in see Fig. 14(a). Specifically, two switchable phaseshifters (SPSs) realized by passive all-pass filters vary the phase of both I and Q signals to construct 8PSK modulation in RF domain. The I/Q SPS phase-shifts are controlled by the third input bit stream, B 2 , to take on one of two values, 45 • or 0 • . Fig. 14(a) constitutes the core of a bits-to-RF transmitter presented in [21]. Here, to generate higher local oscillator power, we used two separate SPSs, although a single SPS can alternatively be employed prior to I/Q generation circuit.
RF-8PSK receiver: Likewise, 8PSK demodulation can be carried out directly in RF/analog domain, which mitigates the use of ultra-high-speed high-resolution data converters and sophisticated digital demodulation using off-the-shelf FPGA or backend DSP [20]. Shown in Fig. 14(b) is the conceptual block diagram of the RF-8PSK receiver. Fundamentally based on an advanced version of a direct conversion scheme, this receiver employs an 8-PSK demodulator, which is directly realized in RF/analog domain, thereby obviating the need for ultra-fast DSP and mixed-signal building blocks. The RF demodulator employs four correlation-based detectors driven by multi-phase LO signals with 45 • phase differences. The decision circuitry is comprised of ultrahighspeed comparators and simple logic circuits, only. The RF correlator is essentially a mixer followed by a lowpass filter (LPF). As shown in Fig. 14(b), the 2-D signal space is partitioned into eight angular areas, where each symbol is located in the middle and has maximum Euclidean distance toward the boundaries. To achieve this optimum detection, the received signal phase reference and the LO phase are purposely offset by 22.5 • . As a result, the error tolerance in detecting the symbols will be maximized. Only the polarity of LPF output is needed to determine symbols. Assuming Gray-coding for the three-bit symbols, we can easily decode the bits with much simpler hardware. This will help us explore low-complexity, yet high-speed, baseband circuits which can resolve the symbols. The three bits B 2 , B 1 , B 0 of each symbol can easily be extracted with simple logic circuits from the re-timed outputs of the RF correlators. The bits are easily derived from sign-check comparator outputs, i.e., B 2 =Ȳ, B 1 = W, and B 0 =X ⊕ Z. There is no need for any explicit ADC in this design, and symbols are detected using only sign-check comparators performing essentially BPSK decision and simple logic functions [20]. The carrier synchronization is achieved using an extension of a Costas loop, whereby the downconverted signals out of multi-phase correlators are low-pass-filtered, and the outputs are then fed to multipliers to detect the phase and appropriately adjust the voltage-controlled oscillator (VCO) phase. The symbol and clock recovery are accomplished using a clock recovery circuit and data retimers, similar to the way is done in high-speed wireline receivers.
RF-4 M QAM transmitter: QPSK was observed by prior work to be amenable to analog implementation at ultra-high data rates [18], [19]. Can we construct higher order QAM modulations using a QPSK scheme with easily realizable operations/functions in RF domain? Starting with a simpler form of 4 M QAM, e.g., 16-QAM, we explore a modular approach to generate this constellation from a QPSK scheme. At the first glance, a 16QAM constellation is clearly comprised of four QPSKs across four quadrants of the complex plane (see Fig. 15(a)). An alternative perspective is to start with a QPSK cluster, and replicate it around four distinct origins, indicated by gray rectangles in Fig. 15(a). To construct a 16QAM constellation directly in RF domain, only two QPSK clusters are needed, QPSK1 with a symbol spacing of 2d and QPSK2 with 4d symbol spacing. QPSK2 is responsible for mapping the (0, 0) origin to four origins located at (−2d, −2d), (−2d, +2d), (2d, −2d), (2d, 2d). A Cartesian vector summation of QPSK1 symbols with those of QPSK2 in the complex plane will generate four new random symbols around each new origin at each quadrant, thereby resulting in generation of 16QAM constellation [23], [80]. Similarly, an RF-64QAM constellation is directly realized by replicating an RF-16QAM constellation across four quadrants and around four origins obtained by another QPSK cluster, QPSK3, with symbol spacing of 8d. In general, to construct a high-order 4 M QAM scheme, this procedure will employ M QPSKs with symbol spacing of 2kd (k = 1, . . . , M). Once M QPSK clusters are generated in RF domain, this iterative procedure only requires scaling and vector summation, which are easily implemented in analog (or RF) domain [23]. Therefore, M QPSK signals having constant magnitude ratio of two are combined in order to build RF-4 M QAM constellation (shown in Fig. 15(b)). A conceptual block diagram of an RF-16QAM transmitter is shown in Fig. 16.
Assuming a random data stream, the error vectors in each QPSK constellation satisfy a two-dimensional Gaussian distribution with no correlation. Since each QPSK signal is constructed from randomly independent bits, the overall error vector power in an RF-4 M QAM constellation after addition is shown to be the weighted summation of EVMs of constituent M QPSK signals. Considering a special case where all QPSK signals exhibit the same EVM, the high-order QAM EVM will be equal to the QPSK EVM [23]. This attribute brings along a number of advantages: (1) The QPSK signal can be generated using only symbol-rate timing with negligible degradation [23]. This readily relaxes the speed requirement of the mixed-signal interface by more than half compared to conventional DAC-based transmitter. (2) With no highresolution high-speed DAC being present in this transmitter, high frequency linearity bottleneck is dramatically alleviated. It is because the amplitude linearity is no longer critical in achieving a low EVM for a constant-amplitude QPSK signal. (3) A precise magnitude ratio of two between any two side-by-side QPSK signals, as indicated in Fig. 15(b), is easily obtained by fine-tuning the DC bias current of each QPSK modulator, as shown in Fig. 16. This notion reveals yet another advantage compared to the current-source trimming associated with a DAC circuit. Only DC bias tuning is used in the RF-QAM modulation scheme to maintain the magnitude ratio of two between QPSK modulators instead of high-speed RF switching in a high-speed DAC within a conventional transmitter with its modulation being realized in digital domain. (4) Processing QPSK signals mandates much relaxed linearity requirement compared to a 4 M -QAM in a conventional transmitter. This has significant implication on design of the front-end power amplifier operating above 100 GHz. To leverage this attribute, the PA at the power combiner output in Fig. 16, which amplifies a high-PAPR 4 M -QAM signal, can be removed. Instead, each QPSK path employs a PA circuit prior to power combiner, which can now handle a low-PAPR constant-amplitude signal.
In summary, this RF-4 M QAM method shows elevated performance at much lower power consumption at THz frequencies compared to digital realization of QAM in conventional transmitters, because ultra-high-speed low-EVM QPSK signals can readily be constructed [19]. Finally, efforts are currently under way to explore power-efficient RF-4 M QAM demodulators.

IV. CORE BUILDING BLOCKS FOR THz TRANSCEIVERS A. THz AMPLIFIER DESIGN
The amplifiers designed at THz frequency range should be able to exhibit high performance at center frequencies beyond 100 GHz frequency range, while covering 20-30% fractional bandwidth. The power gain G p of an amplifier in terms of its MAG, G max , is readily calculated to be: Around f max frequencies, source and load conjugate matching at the amplifier's center frequency f c is critical due to low available gain of transistors. This leaves little or no room to realize matching networks aiming to widen the amplifier's bandwidth on high side of the passband all the way to the upper corner frequency f H , where G max is dropping from its value at f c . Recently, a few silicon-based THz amplifiers have been reported [16], [81], [82]. Great efforts have been made to improve the power gain of a THz amplifier, introducing all kinds of "embedding network" to the core device, as exemplified in generic block diagram representation in Fig. 17 [83], [84]. To gain a better insight into the THz amplifier design, we will look into the effects of active and passive (mainly due to matching networks) components, separately.
Actives: Although modern silicon technologies provide transistor devices with close to half-THz f max , G maxlimitation at around 0.5f max , loosely defined as near-f max frequencies, is a bottleneck for a THz amplifier design. Several powerful approaches to design amplifiers with power gain close to the MAG were proposed [16], [81]- [84]. Most notably, the gain-plane approach provides graphical representation where the contours of constant power gains are plotted within Im[U/A] − Re[U/A] plane (U is Mason U [16], and A = Y 21 /Y 12 is the maximum stable gain (MSG)), as demonstrated in Fig. 18. Any embedding network surrounding the main amplifier will be representing a locus crossing these constant gain contours [85], [86], as also shown in Fig. 18.
Our analysis in [83] proved that maximum value of MAG, proved to be equal to max [G max ] = (2U −1)+2 √ U(U − 1), is achieved if and only if the imaginary part of A is zero and the device operates at the edge of unconditional stability region, i.e., Equation (7) sets forth the necessary and sufficient condition for an RF amplifier to attain the theoretical upper limit of its power gain. Moreover, two implications can be inferred from Eq. (7). First, it corroborates the commonly known intuitive approach, which primarily maintains that pushing the device towards its instability region will result in higher power gain. It is because the maximum power gain always occurs at the edge of the unconditional stability region. Second, setting K f = 1 is not sufficient for the amplifier to reach its maximum power gain. This is because the imaginary part of U/A must also be zeroed, meaning that the phase of Y 21 and Y 12 must be the same.
Two basic types of embedding networks for a single device can be conceived; Y-and Z-embedding (see Figs. 19(a)-19(b)). Usually, reactive elements (e.g., inductors or capacitors) are used to realize these networks. Acting as a local-shunt feedback, the Y-embedding network can readily be characterized using Y-parameters. More precisely, Y f is added to both Y 11 and Y 22 , while being subtracted from Y 12 and Y 21 [Y f is the admittance of Y-embedding network]. Similar observation is made for the Z-embedding network, which acts as a local series feedback. The most widely used Y-embedding network is a pair of cross-connecting capacitors, C n , in a differential pair in Fig. 20, acting as a neutralizing network [87].
An important notion, which is often missed in the design of THz amplifiers, is how to maintain gain flatness and small group delay variation across a frequency band as wide as 20-30% of the center frequency. A layout-parasitic-extracted differential pair with W/L=32μm/32nm in a 32 nm CMOS SOI process under four distinct neutralization capacitors, C n , were simulated to study G max , and the simulation result is shown in Fig. 20. For K f ≤ 1, the simulated G max is indeed equal to MSG, A, and the corner frequency, f cor , of the circuit of Fig. 20 is readily derived to be: where C X = C n − C gd . A study of the gain plots in Fig. 20 reveals that the gain is falling proportionally with for frequencies above f cor . This faster gain drop implies that the frequency range of interest for the wideband amplifier should be selected below f cor . This is because for a transceiver handling high-order modulation scheme, the flatness of the front-end amplifier's frequency response is important. Otherwise, a degradation of EVM due to amplitude fluctuation is to be expected. In addition, the power gain will be significantly compromised to achieve a flat frequency response above f cor . Therefore, for THz wideband amplifier design, the upper limit, f H , of the frequency response is set to be equal to f cor , i.e., f H = f cor .
Assuming that f H = f cor and K f < 1 for f < f cor , the embedded device will become conditionally stable. In this case, the deployment of matching networks having frequency-dependent loss at the input and output ports of the amplification stage should increase the overall stability factor so that K f → 1. As a consequence, the core amplification stage with matching networks can be designed to operate at the boundary of unconditional and conditional stability across the amplifier's frequency band. This is clearly shown in Fig. 21 that demonstrates a graphical illustration of the design methodology for wideband THz amplifier.
In practice, the input and output impedances of a THz amplifier are determined by the preceding stage as well as the antenna load in the case of a front-end power amplifier, which cannot be an ideal resistive 50 across the band. Moreover, the Bode-Fano criterion also specifies fundamental limitations for wideband input/output matching networks [88]. Therefore, a reliable THz amplifier design should guarantee stability under any loading condition across the entire operating frequency range, or equivalently, K f ≥ 1 across the entire bandwidth. The matching network plays an important role in satisfying this condition, as will be discussed next.
Matching Network: Reference [16] introduced overneutralization to achieve gain-boosting in amplifiers designed to operate at near-f max frequencies. It is, however, noteworthy that over-neutralization essentially is a narrowband technique. This means it helps boost the power gain, while shrinking the bandwidth. The input-output and inter-stage matching networks play a critical role in widening the frequency range. Fig. 21 shows an illustration of the amplifier design methodology [89]. The input/output matching networks mainly perform two tasks, i.e., (1) they re-shape the frequency response with acceptable reflection and insertion losses, and (2) they stabilize the gain stage so that the entire amplifier will become unconditionally stable. As mentioned before, the gain flatness and stability are considered to be the key performance targets in amplifier design with 20-30% of fractional bandwidth. As such, the matching networks and their associated loss are designed so as to satisfy these performance targets. Fig. 22 shows a differential amplification stage with input and output matching circuits where n in and n out represent input and output transformation ratios. Y A,T and K f ,T are the overall network Y-parameter matrix and stability factor, respectively. To guarantee stability and maximum gain requirements, the amplifier with matching networks is designed so that K f ,T ≥ 1. The matching loss can be quantified using the network Q. The circuit analysis of the circuit in Fig. 22 leads to the following relationship between network Q and the core amplifier K-factor, K f [89].
Specifically, (9) provides a clear relationship between matching loss that leads to bandwidth of interest and K f . Two approaches can be taken into consideration to design matching networks at THz frequencies. One approach uses T-junction matching networks to accomplish impedance matching. The microstrip-based structures are easily modeled using electromagnetic (EM) simulation tools. Shown in Fig. 22 is the schematic of a neutralized amplifier with microstrip T-section input and output matching networks. The major issue with the T-section network involves piecewise partial rotations of the immittance on the Smith-chart, which essentially results in narrowband matching. The same figure also shows the same core amplification stage with interstage transformer matching networks. The transformerbased interstage matching network can be viewed as a double-tuned passive network. The transformer bandwidth is a function of coupling coefficient and loaded quality factors of primary and secondary sides, and can increase by as much as 40% of the center frequency at the expense of larger in-band ripples. It is thus conceivable that front-end amplifiers intended for high data rate wideband applications to use a version of transformer-based matching.

B. THz LO GENERATION AND DISTRIBUTION
As the operation frequency increases, the implementation of low phase-noise LOs with adequate tuning range and output power becomes increasingly challenging. For a transceiver array with large array size, distribution of the LO across all the constituent transceivers adds another level of design hurdle. To address the LO distribution challenge, the core synthesizer can employ a subharmonic PLL at 1/M th of the desired LO frequency. The LO distribution network will then carry LO signal at much lower frequency which is then boosted to the desired range with the aid of local frequency multipliers [90]. We will briefly discuss the LO generation and distribution in this section.
At the core of the PLL lies the VCO and divider chain determining the phase-noise, tuning range, and output power [52]. At THz frequencies, a multi-port oscillator circuit with multiple active devices exciting a multi-port passive structure, as a shared resonator, can produce much higher oscillation power and efficiency than current oscillators [91]. However, the use of multiple independent elements to excite a multi-port passive structure can lead to multiple stable oscillation states [92]- [94]. Inherent to such systems, this attribute calls for a careful study that can quantitatively discover oscillation conditions that result in different oscillation states within a multi-port structure consisting of multiple active devices and passive networks. Reference [91] conducted a comprehensive study of the multi-port oscillators. Among multi-port oscillators, the ones with circular geometry are considered to be viable topology for low-noise THz oscillators [95]. Assuming an N-port circularly-symmetric passive network Y, an N-port circularly symmetric excitation networkỸ exists which can potentially generate oscillation once it is connected to this circularly-symmetric passive network. The low phase-noise attribute of the circularly symmetric multi-port oscillator is easily understood once it is viewed as a N coupled oscillators. It is commonly known that a well-designed system of N-coupled oscillators should have 20 log 10 N lower phase noise than a single oscillator. Another critical advantage of a multi-port circularly symmetric oscillator within the context of a phase-locked-based LO design is that next stage frequency divider, whose performance can be as critical as the VCO, can be designed to employ multi-injection architecture. One example of a multi-injection frequency divider, showcasing the MOS implementation of the circuit presented in [96], is depicted in Fig. 24, where three phases of a three-stage VCO output are injected to nodes B k , 1 ≤ k ≤ 3. In this circuit, the amplifying stages (T a2 , T b2 , T c2 ) with the transmission lines L DM and L DB form the divider's three-stage ring and transistors T a1 , T b1 , T c1 act as the three mixing cells. The three-phase input signals coming from preceding three-stage VCO are fed to the gate terminals of the mixing cells, and subsequently, mixed with the loop's 3rd harmonic signals. The mixer's outputs (1/4f 0 ∠0 • ), 1/4f 0 ∠120 • , 1/4f 0 ∠240 • ) flow back to the loop at three injection points.
One example of LO distribution network and a sub-THz harmonic-based frequency tripler are shown in Fig. 25(a) and 25(b), respectively. In this example, the lower frequency, f 0 , (e.g., 13 GHz) is realized by a PLL. The LO distribution network also operating at much lower frequency (e.g., 13 GHz) carries the signal to K transceivers. The LO frequency is locally boosted to the desired frequency (e.g., 117 GHz) by a cascade of local frequency multipliers (e.g., two frequency multiplers (e.g., triplers) in Fig. 25(a)). Though the in-band phase noise in this synthesizer is magnified by 20 log 10 M due to frequency multiplication, this degradation would be offset by the term in Leeson's equation, leaving the improvement in Q-factor, output swing and tuning range at low frequencies as added bonuses to overall phase noise improvement. Due to its inherently wideband characteristic, a harmonic-based frequency multiplication approach is preferred for wideband LO generation. Following the design principle presented in [90], one example of a high frequency harmonic-based frequency tripler is shown in Fig. 25(b). The circuit is comprised of two differential cascode stages driving broadband T-coils and an interstage matching network, where the first cascode stage is biased in the class-C region to maximize the 3rd-harmonic generation efficiency (as was shown in our prior work [97]) and the second stage acts as a wideband amplifier.
In short, the LO generation scheme in Fig. 25(a) offers several advantages. Operation at 1/M th of the LO allows (1) the distribution network to be scalable to large arrays, and (2) more variety of synthesizer architectures including all-digital or fractional-N PLLs to be considered.

C. THz ANTENNA DESIGN
The antenna design at beyond 100−GHz frequency range, while covering a fractional bandwidth of 30%, is quite challenging. An on-chip antenna [16], [98]- [100] greatly simplifies the antenna-transceiver interface, as it is integrated alongside the rest of the system. This means the antenna and the feedlines can be co-designed and co-optimized with the front-end modules. Due to limited elevation of the metal stack, the on-chip antenna suffers from poor antenna gain and efficiency. For instance, [101] designed an on-chip slot-folded dipole antenna (SFDA) with coplanar-waveguide (CPW) feed line at W-band ( Fig. 26(a)). To improve the antenna efficiency, a patterned deep trench mesh with a depth of 7 − 10 μm from the substrate was embedded in the substrate underneath the antenna. In spite of using the deep trench lattice, this SFDA exhibited an efficiency of 16% and peak antenna gain of −4 dBi (Fig. 26(b)).
On the other hand, [102] implemented an aperture-stacked patch (ASP) antenna on printed circuit board (PCB) for the same frequency band. The simulated average gain was 6.3 dBi across 75− to 100−GHz, while the antenna efficiency varied from 88% to 93% [102].
The above two examples provide a clear snapshot about performance achievable by off-chip and on-chip antennas at W-band frequency range. A performance comparison between these two structures favors the off-chip antenna over its on-chip counterpart. It is, however, noteworthy that as the frequency is increased beyond 100 GHz, the antenna interface poses far greater challenge than at lower frequencies. More precisely, the transition VIA structure used by [102] to realize the interface reaches its performance limit at higher frequencies. The stringent physical requirement of the VIA structure imposed by limitation of the PCB fabrication at beyond 100 GHz undermines its use. We have extended the design of a PCB antenna to above 100 GHz frequency range [103]. This design was composed of two rectangular stacked patches fed by a rectangular slot coupled to a stripline feed. The 3-dimensional and the side views of the antenna structure are shown in Fig. 27.
In contrast to low-frequency board antennas where the via locations do not impact the performance, in this design, the via arrangement should be optimized to minimize the excitation of unwanted surface waves. The substrate layer material in this flexible printed circuit (FPC) technology has an r of 2.6, which reduces the loss associated with the dielectric material compared to on-chip counterparts. By having the larger antenna at the top, the fringing fields of both antennas will have well-defined connections to the ground layer, thereby increasing the gain and bandwidth of the radiation. The simulated antenna achieves a −10 dB bandwidth of 44.9 GHz with an average realized gain of 5.7 dBi and average efficiency of 73.9% across the bandwidth [103].
Alternative approaches such as copper-pillar or direct waveguide interface to the antenna structure would be more amenable to higher frequencies. Recently, the radio-onglass technology has showed very promising performance at D-band [104].

V. CONCLUSION
This paper presented an overview of challenges behind the design and implementation of THz integrated systems and circuits. Several key performance parameters including the bandwidth; communication range; link reliability and throughout; modulation and spectral efficiency were outlined. At the system-level design, we discussed multi-antenna architecture as an inevitable choice to meet performance requirement in a high-speed wireless link setting. Started with conventional LO-and RF-phase-shifting phased array schemes, we briefly discussed MIMO transceivers incoportating digital and hybrid beamforming. Next, we argued that modern transceiver architectures are fundamentally incapable of addressing unresolved challenges to achieve 50+ Gbps data rates. Delegating (de-)modulation to the digital back-end as commonly done requires high-resolution and high-speed data converters that are impossible to realize in silicon technologies. In addition, methods such as channel bonding often lead to unacceptable amount of power dissipation. We made an argument in favor of novel transmitter and receiver architectures incorporating direct-modulation and direct demodulation in RF domain applicable for beyond-5G communications. Three Examples were presented, namely an RF-8PSK transmitter, an RF-8PSK receiver, and RF-16QAM transmitter. At the circuit-level design, we briefly studied THz amplifiers, LO generation and distribution networks, and antenna designs made several important observations and design guidelines.

ACKNOWLEDGMENT
The author would like to thank all former and current students in the Nanoscale Communication Integrated Circuits (NCIC) Labs including Peyman Nazari, Zheng Wang, Huan Wang, Hossein Mohammadnezhad, Zhiming Chen, Chung-Cheng Wang, and Zisong Wang for their outstanding contributions to many topics of this work. The author would also thank Dr. Hamidreza Aghasi and Hedayatullah Maktoomi for their joint collaboration on off-chip antenna design in [103].