Voltage Reference With Corner-Aware Replica Selection/Merging for 1.4-mV Accuracy in Harvested Systems Down to 3.9 pW, 0.2 V

This paper introduces a voltage reference design able to operate over the wide supply voltage range from 1.8 V down to 0.2 V, and pW power. To mitigate the effect of global variations (e.g., die-to-die, wafer-to-wafer), the proposed NMOS-only architecture introduces process sensor-driven selection/merging of circuit replicas at boot (or run) time. Being the circuit replicas optimized for different process corners, their selection or merging fundamentally relaxes the traditionally conflicting design tradeoffs that affect the overall voltage accuracy in deep sub-threshold, while not requiring any testing-time trimming or non-volatile memory process option for low-cost applications. Measurements of a 180-nm test chip across 45 dice from different corner wafers demonstrate reliable operation down to 0.2 V with 3.9-pW power consumption at room temperature. The proposed process sensor-driven replica selection is shown to enable 1.6% <inline-formula> <tex-math notation="LaTeX">$V_{REF}$ </tex-math></inline-formula> process sensitivity (i.e., <inline-formula> <tex-math notation="LaTeX">$\sigma/\mu)$ </tex-math></inline-formula>, 34.9-<inline-formula> <tex-math notation="LaTeX">$\mu \text{V}/^{\circ }\text{C}$ </tex-math></inline-formula> (819-ppm/°C) mean temperature coefficient, and 60.7-<inline-formula> <tex-math notation="LaTeX">$\mu \text{V}$ </tex-math></inline-formula>/V (0.14-%/V) mean line sensitivity across process corners. The resulting 1.4-mV overall absolute accuracy of the reference voltage across dice and corner wafers (1-<inline-formula> <tex-math notation="LaTeX">$\sigma$ </tex-math></inline-formula>), voltage fluctuations (0.3 V) and temperature deviation (20°C) is improved by <inline-formula> <tex-math notation="LaTeX">$1.9\times $ </tex-math></inline-formula> compared to the case without replica selection, and by 3-<inline-formula> <tex-math notation="LaTeX">$15.4\times $ </tex-math></inline-formula> compared to prior references with sub-nW consumption.


I. INTRODUCTION
Integrated systems relying on harvesting as sole energy source exhibit lower cost, smaller form factor and extended lifetime as compared to battery-powered counterparts [1]. At the same time, their aggressive miniaturization tightly constrains system power requirements [2], [3], [4], [5], [6], [7], [8], [9], [10]. As summarized in Fig. 1(a), energy harvesting sources typically provide a power density in the range of nW/mm 2 to tens of nW/mm 2 [1], excepting very few and The associate editor coordinating the review of this manuscript and approving it for publication was Agustin Leobardo Herrera-May .
particularly favorable sources requiring specific use cases (e.g., high temperature in industrial machines, direct sunlight). The resulting system power floor allowed by such mmscale harvesters under fluctuating environmental conditions is in the nW range. Similarly, the harvester voltage can be rather limited at the lower end of the power extracted from the environment, e.g., below 0.3 V for solar harvesting at dim light conditions [7]. Decreasing the minimum system power P min and the minimum operating voltage V min becomes crucial in harvested systems. Indeed, from Fig. 1(b) reducing P min and V min relaxes supply regulation requirements in conventional harvested systems with intermediate DC-DC conversion between the harvester and the system supply, while also extending the range of environmental conditions under which the system can operate without interruption. From Fig. 1(b), the benefits of P min and V min reductions are even more pronounced in directly-harvested systems with no intermediate DC-DC conversion, whose elimination directly reduces the system power floor for prolonged operation, area cost and design/integration effort. In such systems, lowvoltage references target voltages in the 100-mV range and below, enabling several applications described in prior art [11], [12], [13], [14], [15], [16] and datasheet of commercial chips generating reference voltages down to 0 V [17].
In this work, a global variation-aware NMOS-only voltage reference with ultra-low V min and P min is introduced to support purely-harvested operation [35], while substantially relaxing or suppressing the extra area and power penalty of voltage regulation and DC-DC conversion (e.g., 0.32 mm 2 in [36]). The proposed architecture is uniquely based on voltage reference replicas optimized at different process corners, instead of conventionally adopting a fixed design as a rigid compromise across process corners and other conflicting design targets. As further contribution, an on-chip oscillator-based sensor identifies the process corner even for a single type of transistors (i.e., NMOS instead of being limited to PMOS/NMOS ratio sensing [37], [38]), yet without any additional reference circuitry (e.g., current, time) to suit low-cost applications [8]. The process sensor determines the selection of circuit replicas for extreme corners, as well as the best merging of replicas for intermediate corners by shortcircuiting those optimized for the adjoining corners. Such corner-aware approach simultaneously improves the accuracy against PVT variations, as quantified by the process sensitivity, as well as the mean value of the line sensitivity and of the temperature coefficient. The proposed solution also tightens the impact of global process variations, as quantified by the standard deviation of line sensitivity and temperature coefficient. Replica selection/merging involves a postfabrication self-calibration task to be performed at boot (or run) time. This requires a dedicated time slot, and an extra supply voltage and power for proper operation of the process sensor, 1 compared to normal operation. Self-calibration needs to be carried out once in the chip lifetime (e.g., at the first chip boot) by storing the three bits configuring replica selection/merging. This suppresses the related testing time and effort, reducing testing cost. The latter is traded off with silicon area in view of the additional voltage reference replicas, and the related tradeoff is advantageous for low-cost technologies as required in the targeted applications.
Measurements of a 180-nm test chip across 45 dice from different corner wafers show operation down to 0.2 V and 3.9 pW, while achieving a competitive process sensitivity  Compared to conventional single-replica design, replica selection/merging reduces process sensitivity by 2.5×, and improves the overall accuracy by 1.9×.
The paper is organized as follows. Section II details the proposed reference architecture. Measurement results are discussed in Section III. Section IV presents the comparison with prior art. Conclusions are finally drawn in Section V.

II. PROPOSED VOLTAGE REFERENCE: ARCHITECTURE AND DESIGN GUIDELINES
The proposed reference architecture consists of multiple voltage reference replicas respectively optimized at different corners (i.e., SS, TT and FF), and an on-chip process sensor as in Fig. 2(a). The latter selects the most suitable replica(s) for the process corner that a given die lies at, while also merging the adjoining replicas if it belongs to an intermediate corner, as in Fig. 2(b). In detail, intermediate design tradeoffs between farther corners (e.g., SS or FF) and the TT corner are covered by short-circuiting the output of the two relevant replicas optimized for the two closest corners, thus averaging their output voltage as in Fig. 2(b). As a result, the selection/merging approach covers five process corners with three circuit replicas, narrowing the process spread and mitigating design conflicts across corners for improved and more consistent performance under process variations, as illustrated in Fig. 2(c). The proposed architecture and the design approach are exemplified with the NMOS-only 8-transistor reference replicas in Fig. 3. Each replica consists of a core reference generator M1-M4, a body-bias generator M5-M6, and a deep n-well (DNW) replica bias M7-M8. The output reference voltage V REF is set by the strength ratio of M4 having zero gatesource voltage and the diode-connected stacked transistors M1-M3, whose p-well is biased by the voltage V B provided by M5-M6.
In Fig. 3, the replica bias M7-M8 drives the DNW of M1-M3 through the voltage V DNW ≈ V B , thus imposing a zero voltage across the p-well/DNW parasitic diode D p−well−DNW . Quantitatively, Fig. 4(a) shows that the simulated voltage drop across the D p−well−DNW junction is consistently below 100 µV across temperatures, and is typically much lower. Hence, the junction current I p−well is pushed to near zero, and is several orders of magnitude smaller than the transistor sub-threshold current from Fig. 4(b). Such dramatic reduction in I p−well makes the loading effect of the junction on the deep n-well replica bias negligible. This suppresses the loading effect of the relatively large deep n-well leakage current on V REF and its exponential temperature dependence, simplifying temperature compensation as required to keep power in the pW range [34]. From Fig. 3, the M5-M6 and M7-M8 circuits mimic the structure of the core reference generator, tracking its process and environmental variations to preserve the above properties across process corners, voltages and temperatures. As only difference among these blocks, diode-connected transistors M1-M3 are stacked to increase V REF to the targeted range (more stacked transistors can be employed to further increase it).
From a circuit analysis viewpoint, the stacked transistors M1-M3 of the core reference generator in Fig. 3 can be lumped into an equivalent transistor M1-3. Considering transistor operation in the sub-threshold region, V REF can be thus expressed as, (1), shown at the bottom of the page 5, where W is the channel width, L is the channel length, n is the NMOS slope factor, V TH 0 is the zero-bias threshold voltage V TH , λ BB is the V TH body coefficient, and N stack is the number of stacked diode-connected MOSFETs. In (1), the term λ BB,M 1−3 V B quantifies the body biasing provided by M5-M6 to M1-3 with V B as given by Given transistor operation in the deep sub-threshold region, body biasing of M1-M3 fed forward through V B improves the stability of V REF against supply and temperature fluctuations, as observed in [29] and [34]. Indeed, increasing V DD induces an increase in V REF   As in Fig. 2(a), three circuit replicas were differently sized and optimized at the SS, TT and FF corners, respectively. More specifically, only the sizing of lower transistors M1-M3, M5 and M7 was differentiated across the three optimizations, while keeping the sizes of upper transistors M4, M6 and M8 unchanged (see details in Table 1). A single (regular) threshold design approach was adopted to avoid the additional process spread coming from mistracking of different transistor flavors and types, in contrast to the dualthreshold design in [34].
The single-threshold design strategy motivates the adoption of the different core reference in Fig. 3 with respect to [34], introducing a zero-V GS transistor instead of a reversegate biased device for the upper transistors M4, M6 and M8. Indeed, this avoids the necessity to significantly upsize the upper transistors to balance their strength compared to the lower transistors M1-M3, M5 and M7 (i.e., to provide the necessary current). In turn, this mitigates the area cost of replicas and the effect of layout-dependent variations coming from heavily skewed strength ratios between upper and bottom transistors. The adopted sizing in Table 1 sets a PTAT behavior in V B to cancel out the effect of temperature on V REF to a first order, as shown in Figs. 5(a)-(b). The channel length of M4 was set to the maximum value allowed by the technology to minimize the DIBL effect, and hence the sensitivity to V DD fluctuations. A 1.8-pF MIM capacitor with 30µm × 30µm area was added at the V REF node to improve the power supply rejection ratio (PSRR, see C1 in Fig. 3), as constant capacitive load adopted in the following. However, given the measurement results reported in Section III.B and considering the additional parasitic capacitance associated with the barewire pad and the probe, the overall capacitive load was estimated to be 3.8× larger than C1. Fig. 6 shows the architecture of the proposed process sensor, which comprises a slow oscillator (SO) counting a fast oscillator (FO) to determine the process corner. Strictly speaking, the process sensor is self-sufficient as it does not need any additional accurate time basis or trimming. However, in the common scenario where a system clock is available, the clock replaces SO and hence suppresses its area contribution. From Fig. 6, a 64× frequency divider is introduced at the output of SO to generate a counting window of 64 periods of SO. The resulting increase in the count difference across process corners improves the resolution at the cost of proportionally longer calibration time. Counter VOLUME 11, 2023 handling is carried out by a finite-state machine (FSM), based on the SO frequency. The output of the counter in Fig. 6 quantifies the frequency ratio f H f L of FO with respect to SO, and feeds a decision logic generating the three configuration bits representing the detected corner and the replica selection/merging.
In the process sensor, transistor sizes in SO and FO were purposely differentiated as in Fig. 7(a) to induce a threshold voltage difference V TH that monotonically depends on the corner. More quantitatively, the threshold voltage V TH of SO is higher than that of FO by 12.5 mV, 33 mV and 53.5 mV at SS, TT and FF corners, respectively (see Fig. 7(a)). As a result, the frequency ratio f H f L (i.e., the count in Fig. 6)  ultimately quantifies the corner the die lies in, as shown in Fig. 7(b).
To preserve the NMOS-only nature of the entire architecture, ring oscillators were implemented using ratioed logic as in Fig. 8. From this figure, the SO consists of 41 stages comprising a pull-down transistor M9 and two stacked pullup transistors M10-M11. Conversely, FO adopts a 7-stage structure where each stage comprises three stacked pull-down transistors M12-M14 and three stacked pull-up transistors M15-M17. The relative size and the number of stacked transistors were chosen to achieve similar temperature dependence of f H and f L across process corners, as required to make the count and hence corner identification robust against environmental changes. The absolute size of FO was chosen to be small enough to have an appreciable frequency difference compared to SO across corners (see Fig. 7(a)), 3588 VOLUME 11, 2023  while being large enough to keep mismatch adequately small. More precisely, among the five different size choices of FO implemented in the test chip and reported in Table 2, the design x was selected to achieve SS/TT/FF process discrimination with 2-standard deviation confidence level or better, as discussed in Section III.A.

III. MEASUREMENT RESULTS
The 180-nm test chip in Fig. 9 comprises the process sensor(s) and the reference replicas, occupying an overall area of 59,000 µm 2 , which is reduced to 19,000 µm 2 in the common  Table 2).
case where SO is replaced by the system clock. In Fig. 9, FO is sized as per the design x in Table 2 Fig. 10 shows the measured frequency f H of FO for the different sizes reported in Table 2 across the three different corner wafers. As expected from Fig. 7(a), the sensitivity of f H to the corner is higher at lower FO sizes. On the other hand, smaller FO size implies an increase in mismatch as per Pelgrom's law, as discussed in Section II. Figs. 11(a)-(b) show the resulting frequency ratio f H f L histogram under the smallest considered FO size (i.e., the design x in Table 2 ) across 45 dice from corner wafers, and under different operating conditions (i.e., V DD in the 1.6-1.8 V range, and 0-70 • C temperature range). Considering that f H f L monotonically increases at higher temperatures regardless of the process corner, Figs. 11(a)-(b) compare the statistical distribution of f H f L at 0 • and 70 • C for the TT corner against those at 70 • C for the SS corner (i.e., closest to TT across temperatures), and at 0 • C for the FF corner (again closest to TT). In other words, these figures quantify the impact of process corners and mismatch under the worstcase voltage and temperature corners, and hence quantify the worst-case discrimination capability of the process sensor.   Table 2 on its frequency f H at V DD = 1.8 V and 25 • C across SS, TT and FF corner wafers.  Table 2 (i.e., with the smallest size). process variation bin with >97.7% confidence level (i.e., >2σ , as shown by the non-overlapping ±2σ ranges) within the considered operating range, as targeted in Section II.
FO respectively draws an average supply current of 70 µA, 85 µA and 105 µA for SS, TT and FF samples, when operating at 1.8 V and room temperature. Higher current consumption is expectedly observed in the 41-stage SO structure, respectively amounting to 3.8 mA, 4 mA and 7.3 mA at the SS, TT and FF corners. Nevertheless, such power consumption does not affect the system consumption during its normal operation, since the process sensor operation is activated only once and then shut down indefinitely. From simulations, the start-up time of the process sensor is ∼80 µs at the TT corner, V DD = 1.8 V and 25 • C, thus corresponding to 64 periods of SO having a frequency of 790 kHz.

B. CONVENTIONAL SINGLE-REFERENCE CIRCUIT ON TT WAFER
In this subsection, the conventional single-reference circuit with replica optimized at the TT corner is characterized across dice coming from the TT wafer. The V REF of a typical die sample measured across voltages (from 0.2 V to 1.8 V) and temperatures (from 0 • C to 70 • C) is plotted in Figs. 12(a)-(b). From Fig. 12(a), the single-reference circuit shows correct operation at V DD as low as V min = 0.2 V regardless of the operating temperature. At room temperature, the resulting V REF is ∼43 mV and the absolute line sensitivity is 119 µV/V (0.28 %/V in relative terms), as evaluated across the voltage range from V min up to 1.8 V. In addition, the line sensitivity exhibits a quite negligible temperature dependence in the 113-119 µV/V range (i.e., 0.26-0.28 %/V) across the temperature range. From Fig. 12(b), the absolute temperature coefficient is 43.7 µV/ • C (∼1,000 ppm/ • C in relative terms) at V DD = V min .
The measured power consumption across voltages and temperatures is shown in Figs. 13(a)-(b). The overall reference consumes only 3.2 pW at V min and 25 • C. From Fig. 13(a), the power consumption has a nearly linear dependence on V DD , due to the nearly voltage-independent current drawn from the supply. More quantitatively, the power increases by 10.4× when increasing V DD from V min to 1.8 V. Conversely, the power consumption expectedly has an exponential-like dependence on the temperature, due to the transistor operation in the deep sub-threshold region, as shown in Fig. 13(b). More quantitatively, the power increases by a factor of 11.9× when the temperature is increased from the room temperature to the maximum one.
The dynamic characterization of the single-reference circuit is reported in Figs. 14-16. The measured PSRR is plotted versus frequency in Fig. 14 at room temperature. To minimize the perturbative effect of the testing setup, it is worth pointing out that the die was not bonded or packaged, and was tested directly with the probe station. The same figure also reports post-layout simulation results including an extra output capacitor with value of 0, 5 and 10 pF to mimic the additional contribution of barewire pad and the probe during testing. Both measured and simulated PSRR values at low frequencies are ∼−85 dB and further improves at higher frequencies (e.g., the measured PSRR reaches ∼−103 dB at 100 kHz). Simulation results prove that a larger output capacitor provides better PSRR at high frequencies as in [15]. Moreover, from the comparison between measurements and simulations, we can estimate an additional parasitic capacitance due to the testing setup equal to 5 pF.      settling time to reach the 95% of the steady-state value of V REF at start-up is 2.68 s, as expected by the limited consumption and hence current flowing through M1-M4 in Fig. 3, which ultimately charges the large output capacitance.
Regarding the individual effect of mismatch studied in this subsection, Figs. 17(a)-(d) illustrate the statistical distribution of the single-reference circuit optimized at the TT corner (i.e., no replica selection/merging), as measured across 15 dice from the TT wafer. At V min and 25 • C, the mean value of V REF and its standard deviation are respectively 42.7 mV and 0.6 mV, leading to a mismatch-only process sensitivity Fig. 17(a). From Fig. 17(b), the absolute TC at V min has a mean value of 33.2 µV/ • C (778 ppm/ • C in relative terms). The standard deviation of TC is 4.1 µV/ • C (96 ppm/ • C), corresponding to a variability of 12.3%. From Fig. 17(c), the power consumption exhibits a mean value of 3.2 pW and a standard deviation of 0.1 pW. This leads to a power variability of only 3%, showing highly consistent consumption across local variations. Finally, Fig. 17(d) shows the measurement histogram of the absolute line sensitivity. The mean and standard deviation are respectively 72.3 µV/V and 34.7 µV/V (0.17 %/V and 0.08 %/V in relative terms), which results in a fairly pronounced variability of 48%.

C. PROPOSED REFERENCE WITH AND WITHOUT REPLICA SELECTION/MERGING INCLUDING PROCESS CORNERS
This subsection reports statistical measurements of the proposed reference across corner wafers, comparing the proposed reference in Fig. 2(a) based on replica selection/merging to the conventional single-reference circuit optimized at the TT corner (characterized in the previous subsection).
The comparative analysis is depicted in Figs. 18(a)-(h). Fig. 18(a) shows that the overall process sensitivity across all corners of the conventional single-reference circuit replica is 72 mV) at V min and 25 • C. As expected, the proposed corner-aware replica selection narrows the process spread of V REF across corners, as illustrated in Fig. 18(b). As a result, the process sensitivity PS is 1.6% and is hence 2.5× better than the conventional single-reference case, based on the measured µ V REF = 42.6 mV and σ V REF = 0.68 mV.
In addition to reducing the process sensitivity of V REF , the proposed replica selection also improves the reference performance in terms of TC and LS across process corners. Indeed, the single-replica approach has a mean absolute TC at V min and 0-70 • C temperature range of 41.7 µV/ • C (i.e., 968 ppm/ • C), and LS at 25 • C and 0.2-1.8 V of    The resulting process sensitivity is 2.1× better than the conventional single-reference case.
Overall, the proposed architecture enables improved as well as more consistent performance (i.e., narrowed spread) across corner wafers compared to conventional fixed design.

IV. COMPARISON WITH THE STATE OF THE ART
The proposed architecture is compared to state-of-the-art sub-nW voltage references in Table 3. For fair comparison, the table includes only measured data achieved without trimming. The tradeoff between power and V min for different reference designs is illustrated in Fig. 21. From this figure and Table 3, the 3.9-pW power consumption of the proposed VOLUME 11, 2023 reference is in line with [15] and [34], 3.1-49.2× lower than other prior art references in the same technology, and 9.3× higher than [32] in a different technology. The proposed reference exhibits the lowest V min of 0.2 V, which is 2-7× lower than [15], [24], [25], [29], [30], [31], [32], and [33], and 1.25× lower than the prior best [34]. Such particularly favorable and unique combination of pW-power and ultra-low V min makes the proposed reference well suited for low-cost purely-harvested sensor nodes.
Regarding the overall reference performance, Table 3 reports the overall absolute and relative accuracy of V REF , as evaluated from process variations at 1-σ V REF across all 45 dice and corner wafers, 0.3-V harvested voltage fluctuation (e.g., a solar cell from low to intense light), and temperature deviation of 20 • C. It is worth pointing out that the absolute accuracy is a fairer and more relevant metric in applications targeting ultra-low V min and power. Indeed, their baseline voltage is very different from other classes of references (e.g., bandgap), and the relative inaccuracy in the transistor current is actually set by the absolute voltage inaccuracy in the sub-threshold region [34] (rather than the relative voltage inaccuracy). Accordingly, the main results below fairly refer to the absolute accuracy, and are complemented with relative accuracy data only for completeness.
Regarding the process sensitivity of V REF , Table 3 shows that the proposed circuit achieves the lowest absolute standard deviation σ V REF of 0.68 mV including global variations, which is 5.4-27.6× better than prior art demonstrations reporting measurements across corner wafers [25], [29]. Compared to the latter, the achieved relative PS of 1.6% is better than [29] and worse than [25], while it is expectedly higher than other works ignoring global variations [15], [24], [30], [31], [34]. The absolute TC of 34.9 µV/ • C is 2.4× lower than [29] and [32], and 1.2-3.2× higher than other work reported in Table 3. The less relevant TC in relative terms is expectedly worse than other prior art references, owing to the low targeted V REF and the resulting deep sub-threshold transistor operation. From the same table, the absolute LS of 60.7 µV/V is comparable to [15], and 2.3-64× better than other prior art designs. Even considering the LS in relative terms, the achieved value of 0.14 %/V is still competitive, being better than [24], [29], [30], [31], [32], [33], and [34], and only worse than [25] and [15].
When simultaneously considering all PVT variations across corner wafers, the resulting absolute accuracy of V REF is 1.4 mV and represents a 3-15.4× improvement compared to references whose evaluation includes global variations [25], [29], and by 1.9× compared to the single-replica version of the proposed design. Owing to the ultra-low voltage operation of the proposed circuit, the overall relative accuracy of V REF is 3.3% and expectedly worse than [25] and [29], which nonetheless consume a 29.2-49.2× higher power and have a significantly higher V min ∼1 V.

V. CONCLUSION
In this paper, an NMOS-only voltage reference able to operate at pW-range power and across a wide supply voltage range from 1.8 V down to 0.2 V has been introduced for purely-harvested systems, including directly-harvested systems without intermediate DC-DC conversion. Instead of adopting a fixed design as a rigid compromise across conflicting sensitivity to global variations, voltage and temperature, the proposed architecture employs circuit replicas optimized at different process corners.
An on-chip oscillator-based process sensor determines selection or merging of circuit replicas, tightening the process spread across corners. To this aim, a post-fabrication selfcalibration task is carried out at boot (or run) time, which requires a dedicated time slot, and extra supply voltage and power for the process sensor. Self-calibration needs to be performed only once in the chip lifecycle if the resulting configuration bits are stored on chip, while avoiding any testing time and cost. The tradeoff with the increased silicon area due to voltage reference replicas is advantageous in lowcost technologies as required in the targeted applications.
The proposed concept has been demonstrated through measurements of a 180-nm test chip across 45 dice from different corner wafers from SS to FF. Measurement results show reliable operation of the reference circuit down to 0.2 V at 3.9-pW power consumption. The process sensitivity is reduced by 2.5× compared to the conventional case without replica selection/merging, and is competitive compared to prior art designs evaluated across corner wafers. In addition, the relaxed design tradeoffs across process corners allows competitive mean temperature coefficient of 34.9 µV/ • C (819 ppm/ • C) and mean line sensitivity of 60.7 µV/V (0.14 %/V). The resulting 1.4-mV absolute V REF accuracy against PVT variations including wafer-to-wafer and die-to-die process sensitivity is 1.9× better than the case without replica selection/merging, and 3-15.4× better than prior art. IC Group at NUS for fruitful discussion and feedback. (Luigi Fassio and Longyang Lin contributed equally to this work.)