By Topic

Computers, IEEE Transactions on

Issue 1 • Date Jan. 2010

Filter Results

Displaying Results 1 - 19 of 19
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (140 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (155 KB)  
    Freely Available from IEEE
  • A Novel Weighted-Graph-Based Grouping Algorithm for Metadata Prefetching

    Page(s): 1 - 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3449 KB) |  | HTML iconHTML  

    Although data prefetching algorithms have been extensively studied for years, there is no counterpart research done for metadata access performance. Existing data prefetching algorithms, either lack of emphasis on group prefetching, or bearing a high level of computational complexity, do not work well with metadata prefetching cases. Therefore, an efficient, accurate, and distributed metadata-oriented prefetching scheme is critical to leverage the overall performance in large distributed storage systems. In this paper, we present a novel weighted-graph-based prefetching technique, built on both direct and indirect successor relationship, to reap performance benefit from prefetching specifically for clustered metadata servers, an arrangement envisioned necessary for petabyte-scale distributed storage systems. Extensive trace-driven simulations show that by adopting our new metadata prefetching algorithm, the miss rate for metadata accesses on the client site can be effectively reduced, while the average response time of metadata operations can be dramatically cut by up to 67 percent, compared with legacy LRU caching algorithm and existing state-of-the-art prefetching algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs

    Page(s): 16 - 28
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4445 KB) |  | HTML iconHTML  

    Continuous improvements in integration scale have made major microprocessor vendors to move to designs that integrate several processing cores on the same chip. Chip multiprocessors (CMPs) constitute a good alternative to traditional monolithic designs for several reasons, among others, better levels of performance, scalability, and performance/energy ratio. On the other hand, higher clock frequencies and the increasing transistor density have revealed power dissipation and temperature as critical design issues in current and future architectures. Previous studies have shown that the interconnection network of a Chip Multiprocessor (CMP) has significant impact on both overall performance and energy consumption. Moreover, wires used in such interconnect can be designed with varying latency, bandwidth, and power characteristics. In this work, we show how messages can be efficiently managed, from the point of view of both performance and energy, in tiled CMPs using a heterogeneous interconnect. Our proposal consists of two approaches. The first is Reply Partitioning, a technique that splits replies with data into a short Partial Reply message that carries a subblock of the cache line that includes the word requested by the processor plus an Ordinary Reply with the full cache line. This technique allows all messages used to ensure coherence between the L1 caches of a CMP to be classified into two groups: critical and short, and noncritical and long. The second approach is the use of a heterogeneous interconnection network composed of low-latency wires for critical messages and low-energy wires for noncritical ones. Detailed simulations of 8 and 16-core CMPs show that our proposal obtains average savings of 7 percent in execution time and 70 percent in the Energy-Delay squared Product (ED2P) metric of the interconnect over previous works (from 24 to 30 percent average ED2P improvement for the full CMP). Additionally, the sensitivity analysis shows t- hat although the execution time is minimized for subblocks of 16 bytes, the best choice from the point of view of the ED2P metric is the 4-byte subblock configuration with an additional improvement of 2 percent over the 16-byte one for the ED2P metric of the full CMP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Network-on-Chip Hardware Accelerators for Biological Sequence Alignment

    Page(s): 29 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3366 KB) |  | HTML iconHTML  

    The most pervasive compute operation carried out in almost all bioinformatics applications is pairwise sequence homology detection (or sequence alignment). Due to exponentially growing sequence databases, computing this operation at a large-scale is becoming expensive. An effective approach to speed up this operation is to integrate a very high number of processing elements in a single chip so that the massive scales of fine-grain parallelism inherent in several bioinformatics applications can be exploited efficiently. Network-on-chip (NoC) is a very efficient method to achieve such large-scale integration. In this work, we propose to bridge the gap between data generation and processing in bioinformatics applications by designing NoC architectures for the sequence alignment operation. Specifically, we 1) propose optimized NoC architectures for different sequence alignment algorithms that were originally designed for distributed memory parallel computers and 2) provide a thorough comparative evaluation of their respective performance and energy dissipation. While accelerators using other hardware architectures such as FPGA, general purpose graphics processing unit (GPU), and the cell broadband engine (CBE) have been previously designed for sequence alignment, the NoC paradigm enables integration of a much larger number of processing elements on a single chip and also offers a higher degree of flexibility in placing them along the die to suit the underlying algorithm. The results show that our NoC-based implementations can provide above 102-103-fold speedup over other hardware accelerators and above 104-fold speedup over traditional CPU architectures. This is significant because it will drastically reduce the time required to perform the millions of alignment operations that are typical in large-scale bioinformatics projects. To the best of our knowledge, this work embodies the first attempt to accelerate a bioinformatics application - using NoC. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communication-Aware Load Balancing for Parallel Applications on Clusters

    Page(s): 42 - 52
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2455 KB) |  | HTML iconHTML  

    Cluster computing has emerged as a primary and cost-effective platform for running parallel applications, including communication-intensive applications that transfer a large amount of data among the nodes of a cluster via the interconnection network. Conventional load balancers have proven effective in increasing the utilization of CPU, memory, and disk I/O resources in a cluster. However, most of the existing load-balancing schemes ignore network resources, leaving an opportunity to improve the effective bandwidth of networks on clusters running parallel applications. For this reason, we propose a communication-aware load-balancing technique that is capable of improving the performance of communication-intensive applications by increasing the effective utilization of networks in cluster environments. To facilitate the proposed load-balancing scheme, we introduce a behavior model for parallel applications with large requirements of network, CPU, memory, and disk I/O resources. Our load-balancing scheme can make full use of this model to quickly and accurately determine the load induced by a variety of parallel applications. Simulation results generated from a diverse set of both synthetic bulk synchronous and real parallel applications on a cluster show that our scheme significantly improves the performance, in terms of slowdown and turn-around time, over existing schemes by up to 206 percent (with an average of 74 percent) and 235 percent (with an average of 82 percent), respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Flash Wear-Leveling by Proactively Moving Static Data

    Page(s): 53 - 65
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4500 KB) |  | HTML iconHTML  

    Motivated by the strong demand for flash memory with enhanced reliability, this work attempts to achieve improved flash-memory endurance without substantially increasing overhead and without excessively modifying popular implementation designs such as the flash translation layer protocol (FTL), NAND flash translation layer protocol (NFTL), and block-level flash translation layer protocol (BL). A wear-leveling mechanism for moving data that are not updated is proposed to distribute wear-leveling actions over the entire physical address space, so that static or rarely updated data can be proactively moved and memory-space requirements can be minimized. The properties of the mechanism are then explored with various implementation considerations. A series of experiments based on a realistic trace demonstrates the significantly improved endurance of FTL, NFTL, and BL with limited system overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Model-Driven System Capacity Planning under Workload Burstiness

    Page(s): 66 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2102 KB) |  | HTML iconHTML  

    In this paper, we define and study a new class of capacity planning models called MAP queueing networks. MAP queueing networks provide the first analytical methodology to describe and predict accurately the performance of complex systems operating under bursty workloads, such as multitier architectures or storage arrays. Burstiness is a feature that significantly degrades system performance and that cannot be captured explicitly by existing capacity planning models. MAP queueing networks address this limitation by describing computer systems as closed networks of servers whose service times are Markovian Arrival Processes (MAPs), a class of Markov-modulated point processes that can model general distributions and burstiness. In this paper, we show that MAP queueing networks provide reliable performance predictions even if the service processes are bursty. We propose a methodology to solve MAP queueing networks by two state space transformations, which we call Linear Reduction (LR) and Quadratic Reduction (QR). These transformations dramatically decrease the number of states in the underlying Markov chain of the queueing network model. From these reduced state spaces, we obtain two classes of bounds on arbitrary performance indexes, e.g., throughput, response time, and utilizations. Numerical experiments show that LR and QR bounds achieve good accuracy. We also illustrate the high effectiveness of the LR and QR bounds in the performance analysis of a real multitier architecture subject to TPC-W workloads that are characterized as bursty. These results promote MAP queueing networks as a new class of robust capacity planning models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Conversion Algorithms and Implementations for Koblitz Curve Cryptography

    Page(s): 81 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2251 KB) |  | HTML iconHTML  

    In this paper, we discuss conversions between integers and tau-adic expansions and we provide efficient algorithms and hardware architectures for these conversions. The results have significance in elliptic curve cryptography using Koblitz curves, a family of elliptic curves offering faster computation than general elliptic curves. However, in order to enable these faster computations, scalars need to be reduced and represented using a special base-tau expansion. Hence, efficient conversion algorithms and implementations are necessary. Existing conversion algorithms require several complicated operations, such as multiprecision multiplications and computations with large rationals, resulting in slow and large implementations in hardware and microcontrollers with limited instruction sets. Our algorithms are designed to utilize only simple operations, such as additions and shifts, which are easily implementable on practically all platforms. We demonstrate the practicability of the new algorithms by implementing them on Altera Stratix II FPGAs. The implementations considerably improve both computation speed and required area compared to the existing solutions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Independent Spanning Trees on Multidimensional Torus Networks

    Page(s): 93 - 102
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1195 KB) |  | HTML iconHTML  

    Two spanning trees rooted at vertex r in a graph G are called independent spanning trees (ISTs) if for each vertex v in G, vner, the paths from vertex v to vertex r in these two trees are internally distinct. If the connectivity of G is k, the IST problem is to construct k ISTs rooted at each vertex. The IST problem has found applications in fault-tolerant broadcasting, but it is still open for general graphs with connectivity greater than four. In this paper, we shall propose a very simple algorithm for solving the IST problem on multidimensional torus networks. In our algorithm, every vertex can determine its parent for a specific independent spanning tree only depending on its own label. Thus, our algorithm can also be implemented in parallel systems or distributed systems very easily. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Node-Level Computation Kernels for Parallel Exact Inference

    Page(s): 103 - 115
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2451 KB) |  | HTML iconHTML  

    In this paper, we investigate data parallelism in exact inference with respect to arbitrary junction trees. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with clique width and the number of states of random variables. We study potential table representation and scalable algorithms for node-level primitives. Based on such node-level primitives, we propose computation kernels for evidence collection and evidence distribution. A data parallel algorithm for exact inference is presented using the proposed computation kernels. We analyze the scalability of node-level primitives, computation kernels, and the exact inference algorithm using the coarse-grained multicomputer (CGM) model. According to the analysis, we achieve O(Ndcwc Pij=1 wc rC,j/P) local computation time and O(N) global communication rounds using P processors, 1 les P les maxc PiPij1 wc rC,j, where N is the number of cliques in the junction tree; dc is the clique degree; rC,j is the number of states of the jth random variable in C; wc is the clique width; and ws is the separator width. We implemented the proposed algorithm on state-of-the-art clusters. Experimental results show that the proposed algorithm exhibits almost linear scalability over a wide range. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating Evolutionary Computation with Abstraction Refinement for Model Checking

    Page(s): 116 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2276 KB) |  | HTML iconHTML  

    Model checking for large-scale systems is extremely difficult due to the state explosion problem. Creating useful abstractions for model checking task is a challenging problem, often involving many iterations of refinement. In this paper we consider techniques for model checking in the counter example-guided abstraction refinement. The state separation problem is one popular approach in counterexample-guided abstraction refinement, and it poses the main hurdle during the refinement process. To achieve effective minimization of the separation set, we present a novel probabilistic learning approach based on the sample learning technique, evolutionary algorithm, and effective heuristics. We integrate it with the abstraction refinement framework in the VIS model checker. We include experimental results on model checking to compare our new approach to recently published techniques. The benchmark results show that our approach has overall speedup of more than 56 percent against previous techniques. Our work is the first successful integration of evolutionary algorithm and abstraction refinement for model checking. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predictive Temperature-Aware DVFS

    Page(s): 127 - 133
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1676 KB) |  | HTML iconHTML  

    In this paper, we propose predictive temperature-aware dynamic voltage and frequency scaling (DVFS) using the performance counters that are already embedded in commercial microprocessors. By using the performance counters and simple regression analysis, we can predict the localized temperature and efficiently scale the voltage/frequency. When localized thermal problems that were not detected by thermal sensors are found after layout (or fabrication), the thermal problems can be avoided by the proposed software solution without delaying time-to-market. The evaluation results show that in a Linux-based laptop with the Intel Core2 Duo processor, DVFS using the performance counters performs comparable to DVFS using the thermal sensor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proofs of Correctness and Properties of Integer Adder Circuits

    Page(s): 134 - 136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (79 KB) |  | HTML iconHTML  

    Adder circuits have been extensively studied. Their formal properties are well known, but the proofs are either incomplete or difficult to find. This short contribution intends to integrate all formal proofs related to adders in a single place and to add the details when necessary. The presentation is accessible to general VLSI designer. Another goal of this study is to put together relevant materials for the preparation of further formal studies in computer arithmetic. The presentation is made as concise as possible. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comment on "Performability Analysis: A New Algorithm

    Page(s): 137 - 138
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (294 KB)  

    The paper "performability analysis: a new algorithmrdquo describes an algorithm for computing the complementary distribution of the accumulated reward over an interval of time in a homogeneous Markov process. In this comment, we show that in two particular cases, one of which is quite frequent, small modifications of the algorithm may reduce significantly its storage complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2009 Reviewers List

    Page(s): 139 - 143
    Save to Project icon | Request Permissions | PDF file iconPDF (39 KB)  
    Freely Available from IEEE
  • Call for Papers: Special Section on Dependable Computer Architecture

    Page(s): 144
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • TC Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (155 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (140 KB)  
    Freely Available from IEEE

Aims & Scope

The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Albert Y. Zomaya
School of Information Technologies
Building J12
The University of Sydney
Sydney, NSW 2006, Australia
http://www.cs.usyd.edu.au/~zomaya
albert.zomaya@sydney.edu.au