Loading [MathJax]/extensions/MathZoom.js
Zixuan Wang - IEEE Xplore Author Profile

Showing 1-25 of 53 results

Filter Results

Show

Results

Toward a Universal Cryptographic Accelerator

Srinivas Devadas;Daniel Sanchez

Computer
Year: 2025 | Volume: 58, Issue: 1 | Magazine Article |
Cyberphysical systems have disseminated devices that can be untrustworthy or compromised. Nevertheless, the privacy and integrity of computation and data can be guaranteed through cryptographic protocols. We address the computational burden posed by cryptography, and argue for a synergistic approach of designing programmable hardware accelerators for cryptography, followed by tailoring cryptograph...Show More

Toward a Universal Cryptographic Accelerator

Srinivas Devadas;Daniel Sanchez

Year: 2025 | Volume: 58, Issue: 1 | Magazine Article |
Sparse data structures like hash tables, trees, or compressed tensors are ubiquitous, but operations on these structures are expensive and inefficient on current systems. Prior work has proposed hardware acceleration for these operations, but these techniques have two key shortcomings: they limit the types of data structures they support, and they focus on reads but do not support fine-grained upd...Show More
Solving sparse systems of linear equations is a fundamental primitive in many numeric algorithms. Iterative solvers provide an efficient way of solving large, highly sparse systems. However, iterative solvers are inefficient on existing architectures because they perform computations with (1) poor short-term reuse, which causes frequent off-chip memory traffic; and (2) challenaing data dependences...Show More
Zero-Knowledge Proofs (ZKPs) are a cryptographic tool that enables one party (a prover) to prove to another (a verifier) that a statement is true, without requiring the prover to disclose any data to the verifier. ZKPs have many use cases, such as letting clients delegate computation to servers with cryptographic correctness guarantees, while enabling the server to use secret data in these computa...Show More
Accelerating matrix multiplication is crucial to achieve high performance in many application domains, including neural networks, graph analytics, and scientific computing. These applications process matrices with a wide range of sparsities, from completely dense to highly sparse. Ideally, a single accelerator should handle matrices of all sparsity levels well. However, prior matrix multiplication...Show More
Fast simulation of digital circuits is crucial to build modern chips. But RTL (Register-Transfer-Level) simulators are slow, as they cannot exploit multicores well. Slow simulation lengthens chip design time and makes bugs more frequent.We present ASH, a parallel architecture tailored to simulation workloads. ASH consists of a tightly codesigned hardware architecture and compiler for RTL simulatio...Show More
Solving sparse systems of linear equations is a crucial component in many science and engineering problems, like simulating physical systems. Sparse matrix factorization dominates a large class of these solvers. Efficient factorization algorithms have two key properties that make them challenging for existing architectures: they consist of small tasks that are structured and compute-intensive, and...Show More
Sparse CNNs dramatically reduce computation and storage costs over dense ones. But sparsity also makes CNNs more data-intensive, as each value is reused fewer times. Thus, current sparse CNN accelerators, which process one layer at a time, are bottlenecked by memory traffic.We present ISOSceles, a new sparse CNN accelerator that dramatically reduces data movement through inter-layer pipelining: ov...Show More
Irregular applications are increasingly common in diverse domains, like graph analytics and sparse linear algebra. Accelerating these applications is challenging because of their unpredictable data reuse and control flow. Recent work has proposed hardware support for fine-grain pipeline parallelism, hiding long latencies by decoupling irregular applications into pipeline stages. However, this prio...Show More
Benchmarks that closely match the behavior of production workloads are crucial to design and provision computer systems. However, current approaches fall short: First, open-source benchmarks use public datasets that cause different behavior from production workloads. Second, blackbox workload cloning techniques generate synthetic code that imitates the target workload, but the resulting program fa...Show More
Fully homomorphic encryption (FHE) allows computing on encrypted data, enabling secure offloading of computation to untrusted servers. Though it provides ideal security, FHE is prohibitively expensive when executed in software. These overheads are a major barrier to FHE's widespread adoption. We present F1, the first FHE accelerator that is capable of executing full FHE programs. F1 builds on an i...Show More
We live in a new Cambrian Explosion of hardware devices. The end of conventional processor scaling has driven research and industry practice to explore a new generation of approaches. The old DNA of architecture design, including vectors, threads, shared or private memories, coherence or message passing, dataflow or von Neumann execution, are hybridized together in new and exciting ways. Each new ...Show More
Irregular applications, such as graph analytics and sparse linear algebra, exhibit frequent indirect, data-dependent accesses to single or short sequences of elements that cause high main memory traffic and limit performance. Data compression is a promising way to accelerate irregular applications by reducing memory traffic. However, software compression adds substantial overheads, and prior hardw...Show More
We offer the first security analysis of cache compression, a promising architectural technique that is likely to appear in future mainstream processors. We find that cache compression has novel security implications because the compressibility of a cache line reveals information about its contents. Compressed caches introduce a new side channel that is especially insidious, as simply storing data ...Show More
Applications with irregular memory accesses and control flow, such as graph algorithms and sparse linear algebra, use high-performance cores very poorly and suffer from dismal IPC. Instruction latencies are so large that even SMT cores running multiple data-parallel threads suffer poor utilization.We find that irregular applications have abundant pipeline parallelism that can be used to boost util...Show More
Multicores are now ubiquitous, but programmers still write sequential code. Speculative parallelization is an enticing approach to parallelize code while retaining the ease of sequential programming, making parallelism pervasive. However, prior speculative parallelizing compilers and architectures achieved limited speedups due to high costs of recovering from misspeculation and hardware scalabilit...Show More
Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential locality—it is just hard to predict ahead of time. In practice, graphs have well-connected regions where relatively few vertices share edges with many...Show More
Conventional multicores rely on deep cache hierarchies to reduce data movement. Recent advances in die stacking have enabled near-data processing (NDP) systems that reduce data movement by placing cores close to memory. NDP cores enjoy cheaper memory accesses and are more area-constrained, so they use shallow cache hierarchies instead. Since neither shallow nor deep hierarchies work well for all a...Show More
Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use and is crucial to scale many challenging applications, while non-speculative parallelism is more efficient and allows parallel irrevocable actions (e.g., parallel I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) systems support s...Show More
We present Hotpads, a new memory hierarchy designed from the ground up for modern, memory-safe languages like Java, Go, and Rust. Memory-safe languages hide the memory layout from the programmer. This prevents memory corruption bugs and enables automatic memory management. Hotpads extends the same insight to the memory hierarchy: it hides the memory layout from software and takes control over it, ...Show More
Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management. We presen...Show More
Memoization improves performance and saves energy bycaching and reusing the outputs of repetitive computations. Prior work has proposed software and hardware memoization techniques, but both have significant drawbacks. Software memoization suffers from high runtime overheads, and is thus limited to long computations. Conventional hardware memoization techniques achieve low overheads and can memoiz...Show More
Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods.In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We ...Show More
The following topics are dealt with: microprocessor chips; cache storage; multiprocessing systems; DRAM chips; storage management; graphics processing units; power aware computing; parallel architectures; memory architecture; and integrated circuit design.Show More
Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy eve...Show More