By Topic

High-Performance Reconfigurable Computing Technology and Applications, 2008. HPRCTA 2008. Second International Workshop on

Popular Articles (November 2014)

Includes the top 50 most frequently downloaded documents for this publication according to the most recent monthly usage statistics.
  • 1. Massively parallelized Quasi-Monte Carlo financial simulation on a FPGA supercomputer

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (449 KB) |  | HTML iconHTML  

    Quasi-Monte Carlo simulation is a specialized Monte Carlo method which uses quasi-random, or low-discrepancy, numbers as the stochastic parameters. In many applications, this method has proved advantageous compared to the traditional Monte Carlo simulation method, which uses pseudo-random numbers, as it converges relatively quickly, and with a better level of accuracy. We implemented a massively parallelized Quasi-Monte Carlo simulation engine on a FPGA-based supercomputer, called Maxwell, and developed at the University of Edinburgh. Maxwell consists of 32 IBM Intel Xeon blades each hosting two Virtex-4 FPGA nodes through PCI-X interface. Real hardware implementation of our FPGA-based quasi-Monte Carlo engine on the Maxwell machine outperforms equivalent software implementations running on the Xeon processors by 3 orders of magnitude, with the speed-up figure scaling linearly with the number of processing nodes. The paper presents the detailed design and implementation of our Quasi-Monte Carlo engine in the context of financial derivatives pricing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2. Virtualizing and sharing reconfigurable resources in High-Performance Reconfigurable Computing systems

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (349 KB) |  | HTML iconHTML  

    High-performance reconfigurable computers (HPRCs) are parallel computers but with added FPGA chips. Examples of such systems are the Cray XT5h and Cray XD1, the SRC-7 and SRC-6, and the SGI Altix/RASC. The execution of parallel applications on HPRCs mainly follows the single-program multiple-data (SPMD) model, which is largely the case in traditional high-performance computers (HPCs). In addition, the prevailing usage of FPGAs in such systems has been as co-processors. The overall system resources, however, are often underutilized because of the asymmetric distribution of the reconfigurable processors relative to the conventional processors. This asymmetry is often a challenge for using the SPMD programming model on these systems. In this work, we propose a resource virtualization solution based on partial run-time reconfiguration (PRTR). This technique will allow sharing the reconfigurable processors among the underutilized processors. We will present our virtualization infrastructure augmented with an analytical investigation. We will verify our proposed concepts with experimental implementations using the Cray XD1 as a testbed. It will be shown that this approach is quite promising and will allow full exploitation of the system resources with fair sharing of the reconfigurable processors among the microprocessors. Our approach is general and can be applied to any of the available HPRC systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 3. Performance potential of molecular dynamics simulations on high performance reconfigurable computing systems

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7502 KB) |  | HTML iconHTML  

    The acceleration of molecular dynamics (MD) simulations using high performance reconfigurable computing (HPRC) has been much studied. Given the intense competition from multicore, and from other types of accelerators, there is now a question whether MD on HPRC can be competitive. We concentrate here on the MD kernel computation-determining the force between short-range particle pairs-and examine it in detail to find the performance limits under current technology and methods. We systematically explore the design space of the force pipeline with respect to arithmetic algorithm, arithmetic mode, precision, and various other optimizations. We examine simplifications that are possible if the end-user is willing to trade off simulation quality for performance. And we use the new Altera floating point cores and compiler to further optimize the designs. We find that for the Stratix-III, and for the best (as yet unoptimized) single precision designs, 11 pipelines running at 250 MHz can fit on the FPGA. If a significant fraction of this potential performance can be maintained in a full implementation, then HPRC MD should be highly competitive. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 4. Hardware task scheduling optimizations for reconfigurable computing

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (486 KB) |  | HTML iconHTML  

    Reconfigurable computers (RC) can provide significant performance improvement for domain applications. However, wide acceptance of todaypsilas RCs among domain scientist is hindered by the complexity of design tools and the required hardware design experience. Recent developments in hardware/software co-design methodologies for these systems provide the ease of use, but they are not comparable in performance to manual co-design. This paper aims at improving the overall performance of hardware tasks assigned to FPGA. Particularly the analysis of inter-task communication as well as data dependencies among tasks are used to reduce the number of configurations and to minimize the communication overhead and task processing time. This work leverages algorithms developed in the RC and reconfigurable hardware (RH) domains to address efficient use of hardware resources to propose two algorithms, weight-based scheduling (WBS) and highest priority first-next fit (HPF-NF). However, traditional resource based scheduling alone is not sufficient to reduce the performance bottleneck, therefore a comprehensive algorithm is necessary. The reduced data movement scheduling (RDMS) algorithm is proposed to address dependency analysis and inter-task communication optimizations. Simulation shows that compared to WBS and HPF-NF, RDMS is able to reduce the amount of FPGA configurations to schedule random generated graphs with heavy weight nodes by 30% and 11% respectively. Additionally, the proof-of-concept implementation of a complex 13-node example task graph on the SGI RC100 reconfigurable computer shows that RDMS is not only able to trim down the amount of necessary configurations from 6 to 4 but also to reduce communication overhead by 48% and the hardware processing time by 33%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 5. Scalable FPGA-array for high-performance and power-efficient computation based on difference schemes

    Page(s): 1 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (953 KB) |  | HTML iconHTML  

    For numerical computations requiring a relatively high ratio of data access to operation, the scalability of memory bandwidth is key to performance improvement. In this paper, we propose a scalable FPGA-array to achieve custom computing machines for high-performance and power-efficient scientific simulations based on difference schemes. With the FPGA-array, we construct a systolic computational-memory array (SCMA) by homogeneously partitioning the SCMA among multiple tightly-coupled FPGAs. A large SCMA implemented using a lot of FPGAs achieves high-performance computation with scalable memory-bandwidth and scalable arithmetic-performance according to the array size. For feasibility demonstration and quantitative evaluation, we design and implement the SCMA of 192 processing elements over two ALTERA StratixII FPGAs. The implemented SCMA running at 106 MHz achieves the sustained performances of 32.8 to 36.5 GFlops in single precision for three benchmark computations while the peak performance is 40.7 GFlops. In comparison with a 3.4GHz Pentium4 processor, the SCMAs consume 70% to 87% power and require only 3% to 7% energy consumption for the same computations. Based on the requirement model for inter-FPGA bandwidth, we illustrate that SCMAs are completely scalable for the currently available high-end to low-end FPGAs, while the SCMA implemented with the two FPGAs demonstrates the doubled performance of that by the single-FPGA SCMA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 6. Evaluating FPGAs for floating-point performance

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (183 KB) |  | HTML iconHTML  

    Field programmable gate arrays (FPGAs) have been available for more than 25 years. Initially they were used to simplify embedded processing circuits and then expanded into simulating application specific integrated circuit (ASIC) designs. In the past few years they have grown in density and speed to replace ASICs in some applications and to assist microprocessors as attached accelerators. This paper will calculate the floating-point peak performance for three types of FPGAs using 64-bit, 32-bit, and 24-bit word lengths and compare this with a reference quad-core microprocessor. These calculations are further refined to estimate the actual performance of these FPGAs at floating-point calculations and compared with the microprocessor at its optimal design point and also away from this design point. Lastly, the paper explores the nature of floating-point calculations and looks at examples where the same algorithmic accuracy can be achieved with non-floating-point calculations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 7. Floating point based Cellular Automata simulations using a dual FPGA-enabled system

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3314 KB) |  | HTML iconHTML  

    With the recent emergence of multicore architectures, the age of multicore computing might have already dawned upon us. This shift might have triggered the evolution of von Neumann architecture towards a parallel processing paradigm. Cellular Automata- inherently decentralized spatially extended systems consisting of large numbers of simple and identical components with local connectivity, also proposed by von Neumann in 1950s, is the potential candidate among the parallel processing alternatives. The spatial parallelism available on field programmable gate arrays make them the ideal platform to investigate the cellular automata systems as potential parallel processing paradigm on multicore architectures. The authors have been experimenting with this idea for quite some time now and report their progress from a single to a dual FPGA chip based cellular automata accelerator implementation. For D2Q9 Lattice Boltzmann method implementation, we were able to achieve an overall speed-up of 2.3 by moving our Fortran implementation to our single FPGA-based implementations. Further, with our dual FPGA-based implementation, we achieved a speed-up close to 1.8 compared to our single FPGA-based implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 8. Implementing phase unwrapping using Field Programmable Gate Arrays or Graphics Processing Units: A comparison

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (470 KB) |  | HTML iconHTML  

    Phase unwrapping is the process of converting discontinuous phase data into a continuous image. This procedure is required by any imaging technology that uses phase data such as MRI, SAR or OQM microscopy. Such algorithms often take a significant amount of time to process on a general purpose computer, rendering it difficult to process large quantities of information. This paper compares implementations of a specific phase unwrapping algorithm known as Minimum LP norm unwrapping on a field programmable gate array (FPGA) and on a graphics processing unit (GPU) for the purpose of acceleration. The computation required involves a matrix preconditioner (based on a DCT transform) and a conjugate gradient calculation along with a few other matrix operations. These functions are partitioned to run on the host or the accelerator depending on the capabilities of the accelerator. The tradeoffs between the two platforms are analyzed and compared to a general purpose processor (GPP) in terms of performance, power and cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 9. Architecture of a vertically stacked reconfigurable computer

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1345 KB) |  | HTML iconHTML  

    This paper describes a scalable reconfigurable computing system using a Silicon circuit board (SiCB) design approach. The design involves attachment of unpackaged FPGA and memory die to wafer scale silicon substrates. The resulting reconfigurable computing system has substantial density, cost, and power consumption and performance improvements relative to other known integration or construction methods. A description of the SiCB technology and detailed advantages are provided. In addition, the reconfigurable computer architecture resulting from the use of SiCBs is described. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 10. MPI as an abstraction for software-hardware interaction for HPRCs

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (267 KB) |  | HTML iconHTML  

    High performance reconfigurable computers (HPRCs) consist of one or more standard microprocessors tightly coupled with one or more reconfigurable FPGAs. HPRCs have been shown to provide good speedups and good cost/performance ratios, but not necessarily ease of use, leading to a slow acceptance of this technology. HPRCs introduce new design challenges, such as the lack of portability across platforms, incompatibilities with legacy code, users reluctant to change their code base, a prolonged learning curve, and the need for a system-level hardware/software co-design development flow. This paper presents the evolution and current work on TMD-MPI, which started as an MPI-based programming model for multiprocessor systems-on-chip implemented in FPGAs, and has now evolved to include multiple X86 processors. TMD-MPI is shown to address current design challenges in HPRC usage, suggesting that the MPI standard has enough syntax and semantics to program these new types of parallel architectures. Also presented is the TMD-MPI ecosystem, which consists of research projects and tools that are developed around TMD-MPI to further improve HPRC usability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.