Yunshuang Yuan - IEEE Xplore Author Profile

Showing 1-25 of 69 results

Filter Results

Show

Results

Hardware atomic instructions are the building blocks of the synchronization algorithms. Historically, to guarantee atomicity and consistency, they were implemented using memory fences, committing older memory instructions, and draining the store buffer before initiating the execution of atomics. Unfortunately, the use of such memory fences entails huge performance penalties as it implies execution...Show More
Many applications need to perform operations that involve reading a value from memory, modifying it, and then writing it back. Multiple architectures provide hardware support for these operations via read-modify-write (RMW) instructions. The primary benefit is that the read can request a cacheline with write permissions, reducing coherence protocol overhead since the write will find the cacheline ...Show More
Speculative execution and the emergence of Spectre attacks have forced architects to rethink how microprocessors are designed. Several approaches aim to close this security vulnerability while trying to minimize performance degradation, often involving complex and sophisticated mechanisms. These strategies typically entail substantial modifications to the processor core and memory hierarchy, which...Show More
As energy consumption becomes a primary concern for deep learning acceleration, the need to optimize not only data movement but also compute is becoming important. The basic element of compute, the Multiply-Accumulate (MAC) unit, performs the operation X · Y+Z, comprises the compute cores of systolic arrays such as Google’s TPU or Nvidia’s Tensor Cores, and it is found in practically every deep ne...Show More
Memory integrity protection is intended for secure execution, and it is typically associated with programs running on a single core. However, with the emergence of multi-processor systems-on-chip and chiplets, extending memory integrity protection to cache-coherent multiprocessors becomes essential. In this work, we explore for the first time the design space for maintaining coherence in fine-grai...Show More
In a speculative side-channel attack, a secret is improperly accessed and then leaked by passing it to a transmitter instruction. Several proposed defenses effectively close this security hole by either delaying the secret from being loaded or propagated, or by delaying dependent transmitters (e.g., loads) from executing when fed with tainted input derived from an earlier speculative load. This re...Show More
This work uses Dynamic Information Flow Tracking (DIFT) to characterize how memory addresses are made by studying the transformation of data values into memory addresses. We show that in SPEC CPU 2017 benchmarks, a high proportion of values in memory are transformed into memory addresses. The majority of the transformations are done directly without explicit arithmetic instructions. Most of the ad...Show More
The cornerstone for the performance evaluation of computer systems is the benchmark suite. Among the many benchmark suites used in high-performance computing and multicore research, Splash-2 has been instrumental in advancing knowledge for both academia and industry. Published in 1995 and with over 5276 citations and counting, this benchmark suite is still in use to evaluate novel architectural pr...Show More
Although the cache has been a known side-channel for years, it has gained renewed notoriety with the introduction of speculative side-channel attacks such as Spectre, which were able to use caches to not just observe a victim, but to leak secrets. Because the cache continues to be one of the most exploitable side channels, it is often the primary target to safeguard in secure speculative execution...Show More
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs perform...Show More
Speculative side-channel attacks consist of two parts: The speculative instructions that abuse speculative execution to gain illegal access to sensitive data and the side-channel instructions that leak the sensitive data. Typically, the side-channel instructions are assumed to follow the speculative instructions and be dependent on them. Speculative side-channel defenses have taken advantage of th...Show More
Speculative side-channel attacks access sensitive data and use transmitters to leak the data during wrong-path execution. Various defenses have been proposed to prevent such information leakage. However, not all speculatively executed instructions are unsafe: Recent work demonstrates that speculation invariantinstructions are independent of speculative control-flow paths and are guaranteed to even...Show More
Over the past three decades, the parallel applications of the Splash-2 benchmark suite have been instrumental in advancing multiprocessor research. Recently, the Splash-3 benchmarks eliminated performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the definition of the C memory model. In this work, we revisit the Splash-3 benchmarks and adapt them for con...Show More
We propose a novel approach for hardware-based strict TSO persistency, called TSOPER. We allow a TSO persistency model to freely coalesce values in the caches, by forming atomic groups of cachelines to be persisted. A group persist is initiated for an atomic group if any of its newly written values are exposed to the outside world. A key difference with prior work is that our architecture is based...Show More
Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the store buffer is full, store latency is exposed to the processor causing pipeline stalls. The default strategies to mitigate these stalls are to issue prefetch for ownership requests when store instructions commit and to continuously increase the store buffer size. While these strategies considerably ...Show More
Various memory consistency model implementations (e.g., x86, SPARC) willfully allow a core to see its own stores while they are in limbo, i.e., executed (and perhaps retired) but not yet inserted in memory order. This is known as store-to-load forwarding and it is a necessity to safeguard the local thread's sequential program semantics while achieving high performance. However, this can lead to co...Show More
Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of s...Show More
Flexible instruction scheduling is essential for performance in out-of-order processors. This is typically achieved by using CAM-based Instruction Queues (IQs) that provide complete flexibility in choosing ready instructions for execution, but at the cost of significant scheduling energy. In this work we seek to reduce the instruction scheduling energy by reducing the depth and width of the IQ. We...Show More
Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic...Show More
Speculative execution, the base on which modern high-performance general-purpose CPUs are built on, has recently been shown to enable a slew of security attacks. All these attacks are centered around a common set of behaviors: During speculative execution, the architectural state of the system is kept unmodified, until the speculation can be verified. In the event that a misspeculation occurs, the...Show More
The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or depth of the instruction queue is very costly due to the content-addressable logic needed to wakeup and select instructions out-of-order. This work makes the ob...Show More
Way-predictors have long been used to reduce dynamic cache energy without the performance loss of serial caches. However, they produce variable-latency hits, as incorrect predictions increase load-to-use latency. While the performance impact of these extra cycles has been well-studied, the need to replay subsequent instructions in the pipeline due to the load latency increase has been ignored. In ...Show More
In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are responsible for ensuring: i) correct forwarding of stores to loads and ii) correct ordering among loads (with respect to external stores). The first requirement safeguards the sequential semantics of program execution and applies to both serial and parallel code; the second requirement safeguards the ...Show More
We present a non-speculative solution for a coalescing store buffer in total store order (TSO) consistency. Coalescing violates TSO with respect to both conflicting loads and conflicting stores, if partial state is exposed to the memory system. Proposed solutions for coalescing in TSO resort to speculation-and-rollback or centralized arbitration to guarantee atomicity for the set of stores whose o...Show More
Load reordering is important for performance. It allows a core to continue performing accesses to the memory system even when there are older, in-program-order, unperformed accesses (for example, due to long latency misses). The only known solution to allow such reordering in a strong consistency model such as total store ordering (TSO) has been to reorder speculatively and squash-and-re-execute i...Show More