• ### Accurate Parallel Floating-Point Accumulation

Publication Year: 2016, Page(s):3224 - 3238
Using parallel associative reduction, iterative refinement, and conservative early termination detection, we show how to use tree-reduce parallelism to compute correctly rounded floating-point sums in $O(\log N)$ depth.

• ### A Matrix Decomposition Method for Optimal Normal Basis Multiplication

Publication Year: 2016, Page(s):3239 - 3250
We introduce a matrix decomposition method and prove that multiplication in GF $(2^k)$ with a Type 1 optimal normal basis for can be performed using $k^2-1$

• ### A Partial Carry-Save On-the-Fly Correction Multispeculative Multiplier

Publication Year: 2016, Page(s):3251 - 3264
Functional Units that are designed to receive inputs and produce outputs using a non-redundant format typically exhibit an inferior performance. In order to overcome this limitation, the carry-save and partial carry-save formats have been proposed. Both approaches are very suitable when implementing addition trees. Nevertheless, if there are multiplications in the datapath, the inputs to the multiplier must be converted to a non-redundant format.

• ### A Resilient Routing Algorithm with Formal Reliability Analysis for Partially Connected 3D-NoCs

Publication Year: 2016, Page(s):3265 - 3279
3D ICs can take advantage of a scalable communication platform, commonly referred to as the Networks-on-Chip (NoC). In the basic form of 3D-NoC, all routers are vertically connected. Partially connected 3D-NoC has emerged because of physical limitations of using vertical links. Routing is of great importance in such partially connected architectures. A high-performance, fault-tolerant and adaptive routing algorithm is proposed for partially connected 3D-NoCs.

• ### Bio-Inspired Load-Balancing Framework for Loosely Coupled Heterogeneous Server Systems

Publication Year: 2016, Page(s):3280 - 3292
Balancing load among servers is an important research challenge for a large-scale loosely coupled heterogeneous server system (LCHSS), to improve both the total throughput of the system and the quality of service experienced by clients. In practical terms, a load-balancing method for an LCHSS have to drive servers to underloaded states without unnecessary load migrations among servers.

• ### Compressed Signal Processing on Nyquist-Sampled Signals

Publication Year: 2016, Page(s):3293 - 3303
Pattern-recognition algorithms from the domain of machine learning play a prominent role in embedded sensing systems, in order to derive inferences from sensor data. Very often, such systems face severe energy constraints. The focus of this work is to mitigate the computational energy by exploiting a form of compression which preserves a similarity metric widely used for pattern recognition.

• ### Dynamic Resource Allocation for MapReduce with Partitioning Skew

Publication Year: 2016, Page(s):3304 - 3317
MapReduce has become a prevalent programming model for building data processing applications in the cloud. While being widely used, existing MapReduce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is unevenly distributed among reduce tasks. Existing solutions follow a similar principle that repartitions workload among reduce tasks.

• ### ELmD: A Pipelineable Authenticated Encryption and Its Hardware Implementation

Publication Year: 2016, Page(s):3318 - 3331
Authenticated encryption schemes which resist misuse of nonce at some desired level of privacy are two-pass or Mac-then-Encrypt constructions (inherently inefficient but provide full privacy) and online constructions like McOE, sponge-type authenticated encryptions (such as duplex) and COPA. Only the last one is almost parallelizable except that for associated data processing, the final block-cipher call depends on all previous block-cipher calls.

• ### Hardware-Based Malware Detection Using Low-Level Architectural Features

Publication Year: 2016, Page(s):3332 - 3344
Security exploits and ensuant malware pose an increasing challenge to computing systems as the variety and complexity of attacks continue to increase. In response, software-based malware detection tools have grown in complexity, thus making it computationally difficult to use them to protect systems in real-time. Therefore, software detectors are applied selectively and at a low frequency, creating opportunities for malware to inflict damage.

• ### Improving Bit Flip Reduction for Biased and Random Data

Publication Year: 2016, Page(s):3345 - 3356
Nonvolatile memory technologies such as Spin-Transfer Torque Random Access Memory (STT-RAM) and Phase Change Memory (PCM) are emerging as promising replacements to DRAM. Before deploying STT-RAM and PCM into functional systems, a number of challenges still remain must be addressed. Specifically, both require relatively high write energy, STT-RAM suffers from high bit error rates and PCM suffers from limited write endurance.

• ### Multicore-Aware Virtual Machine Placement in Cloud Data Centers

Publication Year: 2016, Page(s):3357 - 3369
Finding the best way to map virtual machines (VMs) to physical machines (PMs) in a cloud data center is an important optimization problem, with significant impact on costs, performance, and energy consumption. In most situations, the computational capacity of PMs and the computational load of VMs are a vital aspect to consider in the VM-to-PM mapping. Previous work modeled computational capacity as a single dimension.

• ### Parallel Algorithms for Generating Harmonised State Identifiers and Characterising Sets

Publication Year: 2016, Page(s):3370 - 3383
Many automated finite state machine (FSM) based test generation algorithms require that a characterising set or a set of harmonised state identifiers is first produced. The only previously published algorithms for partial FSMs were brute-force algorithms with exponential worst case time complexity. This paper presents polynomial time algorithms and also massively parallel implementations of both the characterising set and harmonised state identifiers generation problems.

• ### Reducing the Memory Bandwidth Overheads of Hardware Security Support for Multi-Core Processors

Publication Year: 2016, Page(s):3384 - 3397
To prevent physical attacks on systems, secure processors have been proposed to reduce trusted computing base to the processor itself. In a secure processor, all off-chip data are encrypted and their integrity is protected. This paper investigates how the limited memory bandwidth of multi-core processors affects the design of secure processors. Although the performance of a single-core secure processor is not significantly affected by the limited memory bandwidth, the performance of a multi-core secure processor is significantly affected.

• ### Scalable Power Management for On-Chip Systems with Malleable Applications

Publication Year: 2016, Page(s):3398 - 3412
We present a scalable Dynamic Power Management (DPM) scheme where malleable applications may change their degree of parallelism at run time depending upon the workload and performance constraints. We employ a per-application predictive power manager that autonomously controls the power states of the cores with the goal of energy efficiency. Furthermore, our DPM allows the applications to lend their idle cores to other applications.

• ### Secure and Private RFID-Enabled Third-Party Supply Chain Systems

Publication Year: 2016, Page(s):3413 - 3426
Radio Frequency Identification (RFID) is a key emerging technology for supply chain systems. By attaching RFID tags to various products, product-related data can be efficiently indexed, retrieved and shared among multiple participants involved in an RFID-enabled supply chain. The flexible data access property, however, raises security and privacy concerns. In this paper, we target at security and privacy issues in third-party RFID-enabled supply chain systems.

• ### Statistical Cache Bypassing for Non-Volatile Memory

Publication Year: 2016, Page(s):3427 - 3440
With the increasing data throughput requirement, non-volatile memories, such as STT-RAM, PCM and RRAM, have become very competitive designs as on-chip caches in chip-multi-processors (CMPs). Since the write operations are more expensive in an asymmetric-access cache, it is more valuable to justify the data allocation. However, the asymmetric-access property of non-volatile memory is not well addressed in existing cache bypassing techniques.

• ### Task Mapping for Redundant Multithreading in Multi-Cores with Reliability and Performance Heterogeneity

Publication Year: 2016, Page(s):3441 - 3455
Due to the architectural design, process variations and aging, individual cores in many-core systems exhibit heterogeneous performance. In many-core systems, a commonly adopted soft error mitigation technique is Redundant Multithreading (RMT) that achieves error detection and recovery through redundant thread execution on different cores for an application. However, task mapping approaches for RMT in heterogeneous many-core systems have not been well studied.

• ### TransMap: Transformation Based Remapping and Parallelism for High Utilization and Energy Efficiency in CGRAs

Publication Year: 2016, Page(s):3456 - 3469
In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., power, performance).

• ### Versatile Direct and Transpose Matrix Multiplication with Chained Operations: An Optimized Architecture Using Circulant Matrices

Publication Year: 2016, Page(s):3470 - 3479
With growing demands in real-time control, classification or prediction, algorithms become more complex while low power and small size devices are required. Matrix multiplication (direct or transpose) is common for such computation algorithms. In numerous algorithms, it is also required to perform matrix multiplication repeatedly, where the result of a multiplication is further multiplied again. This paper proposes an optimized architecture for direct and transpose matrix multiplication with chained operations using circulant matrices.

• ### Workload Adaptive Shared Memory Management for High Performance Network I/O in Virtualized Cloud

Publication Year: 2016, Page(s):3480 - 3494
This paper presents the design and implementation of MemPipe, a dynamic shared memory management system for high performance network I/O among virtual machines (VMs) located on the same host. MemPipe delivers efficient inter-VM communication with three unique features. First, MemPipe employs an inter-VM shared memory pipe to enable high throughput data delivery for both TCP and UDP workloads among VMs.

• ### Binary-Ternary Plus-Minus Modular Inversion in RNS

Publication Year: 2016, Page(s):3495 - 3501
A fast RNS modular inversion for finite fields arithmetic has been published at CHES 2013 conference. It is based on the binary version of the plus-minus Euclidean algorithm. In the context of elliptic curve cryptography (i.e., 160-550 bits finite fields), it significantly speeds-up modular inversions. In this paper, we propose an improved version based on both radix 2 and radix 3. This new algorithm is faster than the previous one.

• ### Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks

Publication Year: 2016, Page(s):3502 - 3508
Recently, in order to improve reactive fault tolerance techniques in large scale storage systems, researchers have proposed various statistical and machine learning methods based on SMART attributes. Most of these studies have focused on predicting failures of hard drives, i.e., labeling the status of a hard drive as "good" or not. However, in real-world storage systems, hard drives may experience periods of degraded health status before complete failure.

