By Topic

Computers, IEEE Transactions on

Issue 7 • Date July 2009

Filter Results

Displaying Results 1 - 16 of 16
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (164 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • Process-Variation-Aware Adaptive Cache Architecture and Management

    Page(s): 865 - 877
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3583 KB) |  | HTML iconHTML  

    Fabricating circuits that employ ever-smaller transistors leads to dramatic variations in critical process parameters. This in turn results in large variations in execution/access latencies of different hardware components. This situation is even more severe for memory components due to minimum-sized transistors used in their design. Current design methodologies that are tuned for the worst case scenarios are becoming increasingly pessimistic from the performance angle, and thus, may not be a viable option at all for future designs. This paper makes two contributions targeting on-chip data caches. First, it presents an adaptive cache management policy based on nonuniform cache access. Second, it proposes a latency compensation approach that employs several circuit-level techniques to change the access latency of select cache lines based on the criticalities of the load instructions that access them. Our experiments reveal that both these techniques can recover significant amount of the lost performance due to worst case designs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Software-Based Encoding and Decoding of BCH Codes

    Page(s): 878 - 889
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3783 KB) |  | HTML iconHTML  

    Error correction software for Bose-Chaudhuri-Hochquenghem (BCH) codes is optimized for general purpose processors that do not equip hardware for Galois field arithmetic. The developed software applies parallelization with a table lookup method to reduce the number of iterations, and maximum parallelization under a cache size limitation is sought for a high throughput implementation. Since this method minimizes the number of lookup tables for encoding and decoding processes, a large parallel factor can be chosen for a given cache size. The naive word length of a general purpose CPU is used as a whole by employing the developed mask elimination method. The tradeoff of the algorithm complexity and the regularity is examined for several syndrome generation methods, which leads to a simple error detection scheme that reuses the encoder and a simplified syndrome generation method requiring only a small number of Galois field multiplications. The parallel factor for Chien search is increased much by transforming the error locator polynomial so that it contains symmetric exponents of positive and negative signs. The experimental results demonstrate that the developed software cannot only provide sufficient throughput for real-time error correction of NAND flash memory in embedded systems but also enhance the reliability of file systems in general purpose computers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency

    Page(s): 890 - 901
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2668 KB) |  | HTML iconHTML  

    This paper presents the algorithm and implementation of a new high-performance functional unit for floating-point four-dimensional vector inner product (4D dot product; DP4), which is most frequently performed in 3D graphics application. The proposed IEEE-compliant DP4 unit computes Z = AB + CD + EF + GH in one path and keeps the intermediate rounding by IEEE-754 rounding to nearest even. The intermediate rounding is merged with shift alignment, and intermediate carry-propagated addition and normalization are omitted to reduce latency in the proposed architecture. The proposed DP4 unit is implemented with 0.18-mum CMOS technology and has 12.8-ns critical path delay, which is reduced by 45.5 percent compared to a previous DP4 implementation using discrete multipliers and adders. The proposed DP4 unit also reduces the cycle time of 3D graphics applications by 12.4 percent on the average compared to the usual 3D graphics FPU based on four-way multiply-add-fused units. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decimal Floating-Point Multiplication

    Page(s): 902 - 916
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4274 KB) |  | HTML iconHTML  

    Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. This paper presents the design of two decimal floating-point multipliers: one whose partial product accumulation strategy employs decimal carry-save addition and one that employs binary carry-save addition. The multiplier based on decimal carry-save addition favors a nonpipelined iterative implementation. The multiplier utilizing binary carry-save addition allows for an efficient pipelined implementation when latency and throughput are considered more important than area. Both designs comply with specifications for decimal multiplication given in the IEEE 754 standard for floating-point arithmetic (IEEE 754-2008). The multipliers extend previously published decimal fixed-point multipliers by adding several features, including exponent generation, sticky bit generation, shifting of the intermediate product, rounding, and exception detection and handling. Novel features of the multipliers include support for decimal floating-point numbers, on-the-fly generation of the sticky bit in the iterative design, early estimation of the shift amount, and efficient decimal rounding. Iterative and parallel decimal fixed-point and floating-point multipliers are compared in terms of their area, delay, latency, and throughput based on verified Verilog register-transfer-level models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-Performance Hardware Architectures for Galois Counter Mode

    Page(s): 917 - 930
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6727 KB) |  | HTML iconHTML  

    Various high-performance hardware architectures for Galois counter mode (GCM) in conjunction with various advanced encryption standard (AES) circuits and multiplier-adders are proposed. A total of 17 GCM-AES circuits were synthesized by using a 130-nm CMOS standard cell library, and the trade-offs between speed and hardware resources were evaluated. Our flexible architectures achieved a wide variety of performances from compact (2.56 Gbps with 34.5 Kgates) to high speed (62.6 Gbps with 979.3 Kgates). All of our architectures support key sizes of 128, 192, and 256 bits, while only one previous approach does. Even with variable-length key support, our architecture also achieved the highest hardware efficiency (defined as throughput per gate) among the designs using the same generation of process technology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Process Algebraic View of Latency-Insensitive Systems

    Page(s): 931 - 944
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1330 KB)  

    Latency-insensitive (LI) systems are those which can function correctly in spite of delays along its connecting wires. This delay is assumed to be a multiple of the clock period. The paper presents a single-clock process algebraic model for such systems. It gives the definitions for LI computational blocks and LI connectors. Important properties for these are shown to be satisfied. Composition of such modules can be done by the parallel composition operator of the process algebra. Conditions are given to check for liveness and deadlock freedom of LI systems. Comparison of latency equivalence between streams of events can be done using the model and this leads to a method of proving latency-equivalent modules. The paper is a step toward high-level specification and verification of such systems. The work can be extended to address more complex interconnections by modeling the underlying finite-state machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Homogeneous Architecture for Power Policy Integration in Operating Systems

    Page(s): 945 - 955
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2794 KB) |  | HTML iconHTML  

    A significant volume of research has concentrated on operating system (OS)-directed power management. The primary focus of previous research has been the development of better policies. In this paper, we provide evidence that one policy may outperform another under different conditions. Hence, it is difficult, or even impossible, to design the "best" policy for all computers. We explain how to select the best policies at runtime without user or administrator intervention by using a software framework called the homogeneous architecture for power policy integration (HAPPI). This architecture is portable across different platforms running Linux. HAPPI specifies common requirements for policies and provides an interface to simplify the implementation of policies in a commodity OS. Our approach allows these policies to be compared simultaneously to select the best policy among a set of distinct policies at runtime. Experimental results indicate that HAPPI achieves energy savings within 4 percent of the best individual policy for each device in several computing systems without a priori knowledge of workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frame-Based Packet-Mode Scheduling for Input-Queued Switches

    Page(s): 956 - 969
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2665 KB) |  | HTML iconHTML  

    Most packet scheduling algorithms for input-queued switches operate on fixed-sized packets known as cells. In reality, communication traffic in many systems such as Internet runs on variable-sized packets. Motivated by potential savings of segmentation and reassembly, there has been increasing interest in scheduling variable-sized packets in a nonpreemptive manner known as packet-mode scheduling. This paper studies frame-based packet-mode scheduling for better scalability. It first shows that the admissible condition is no longer sufficient for packet-mode scheduling. Then, a relation between the frame size and packet sizes is derived that classifies under what conditions the packet-mode scheduling problem is polynomial solvable or is NP-hard. This relation reveals an interesting result that under various packet size distributions, it may be polynomial solvable even if many different packet sizes occur in the packet set, whereas it may be NP-hard with just two packet sizes present. Finally, as a practical solution, this paper studies how a speedup can help packet-mode scheduling. It is shown that the admissible condition becomes sufficient also when a speedup of two is used. A simple algorithm with a speedup of two is presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Collusive Piracy Prevention in P2P Content Delivery Networks

    Page(s): 970 - 983
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3218 KB) |  | HTML iconHTML  

    Collusive piracy is the main source of intellectual property violations within the boundary of a P2P network. Paid clients (colluders) may illegally share copyrighted content files with unpaid clients (pirates). Such online piracy has hindered the use of open P2P networks for commercial content delivery. We propose a proactive content poisoning scheme to stop colluders and pirates from alleged copyright infringements in P2P file sharing. The basic idea is to detect pirates timely with identity-based signatures and time-stamped tokens. The scheme stops collusive piracy without hurting legitimate P2P clients by targeting poisoning on detected violators, exclusively. We developed a new peer authorization protocol (PAP) to distinguish pirates from legitimate clients. Detected pirates will receive poisoned chunks in their repeated attempts. Pirates are thus severely penalized with no chance to download successfully in tolerable time. Based on simulation results, we find 99.9 percent prevention rate in Gnutella, KaZaA, and Freenet. We achieved 85-98 percent prevention rate on eMule, eDonkey, Morpheus, etc. The scheme is shown less effective in protecting some poison-resilient networks like BitTorrent and Azureus. Our work opens up the low-cost P2P technology for copyrighted content delivery. The advantage lies mainly in minimum delivery cost, higher content availability, and copyright compliance in exploring P2P network resources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware Architecture for High-Performance Regular Expression Matching

    Page(s): 984 - 993
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1657 KB) |  | HTML iconHTML  

    This paper presents a bitmap-based hardware architecture for the Glushkov nondeterministic finite automaton (G-NFA), which recognizes a given regular expression. We show that the inductions of the functions needed to construct the G-NFA can be generalized to include other special symbols commonly used in extended regular expressions such as the POSIX 1003.2 format. Our proposed implementation can detect the ending positions of all substrings of an input string T, which start at arbitrary positions of T and belong to the language defined by the given regular expression. To achieve high performance, the implementation is generalized to the NFA, which processes K symbols in each operation cycle. We provide an efficient solution for the boundary condition when the length of the input string is not an integral multiple of K. Compared with previous designs, our proposed architecture is more flexible and programmable because the pattern matching engine uses memory rather than logic. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accurate Floating-Point Product and Exponentiation

    Page(s): 994 - 1000
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (172 KB) |  | HTML iconHTML  

    Several different techniques and softwares intend to improve the accuracy of results computed in a fixed finite precision. Here, we focus on a method to improve the accuracy of the product of floating-point numbers. We show that the computed result is as accurate as if computed in twice the working precision. The algorithm is simple since it only requires addition, subtraction, and multiplication of floating-point numbers in the same working precision as the given data. Such an algorithm can be useful for example to compute the determinant of a triangular matrix and to evaluate a polynomial when represented by the root product form. It can also be used to compute the integer power of a floating-point number. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Bit-Parallel GF(2^m) Multiplier for a Large Class of Irreducible Pentanomials

    Page(s): 1001 - 1008
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (853 KB) |  | HTML iconHTML  

    This work studies efficient bit-parallel multiplication in GF(2m) for irreducible pentanomials, based on the so-called shifted polynomial bases (SPBs). We derive a closed expression of the reduced SPB product for a class of polynomials xm + xk s + xk s-1+ hellip + xk-1 + 1, with ks - k1 les m+1/ 2. Then, we apply the above formulation to the case of pentanomials. The resulting multiplier outperforms, or is as efficient as the best proposals in the technical literature, but it is suitable for a much larger class of pentanomials than those studied so far. Unlike previous works, this property enables the choice of pentanomials optimizing different field operations (for example, inversion), yet preserving an optimal implementation of field multiplication, as discussed and quantitatively proved in the last part of the paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TC Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (164 KB)  
    Freely Available from IEEE

Aims & Scope

The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Albert Y. Zomaya
School of Information Technologies
Building J12
The University of Sydney
Sydney, NSW 2006, Australia
http://www.cs.usyd.edu.au/~zomaya
albert.zomaya@sydney.edu.au