By Topic

Application Specific Processors (SASP), 2011 IEEE 9th Symposium on

Date 5-6 June 2011

Filter Results

Displaying Results 1 - 23 of 23
  • [Title page]

    Publication Year: 2011 , Page(s): i - vii
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Freely Available from IEEE
  • How sensitive is processor customization to the workload's input datasets?

    Publication Year: 2011 , Page(s): 1 - 7
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (515 KB) |  | HTML iconHTML  

    Hardware customization is an effective approach for meeting application performance requirements while achieving high levels of energy efficiency. Application-specific processors achieve high performance at low energy by tailoring their designs towards a specific workload, i.e., an application or application domain of interest. A fundamental question that has remained unanswered so far though is to what extent processor customization is sensitive to the training workload's input datasets. Current practice is to consider a single or only a few input datasets per workload during the processor design cycle - the reason being that simulation is prohibitively time-consuming which excludes considering a large number of datasets. This paper addresses this fundamental question, for the first time. In order to perform the large number of runs required to address this question in a reasonable amount of time, we first propose a mechanistic analytical model, built from first principles, that is accurate within 3.6% on average across a broad design space. The analytical model is at least 4 orders of magnitude faster than detailed cycle-accurate simulation for design space exploration. Using the model, we are able to study the sensitivity of a workload's input dataset on the optimum customized processor architecture. Considering MiBench benchmarks and 1000 datasets per benchmark, we conclude that processor customization is largely dataset-insensitive. This has an important implication in practice: a single or only a few datasets are sufficient for determining the optimum processor architecture when designing application-specific processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TARCAD: A template architecture for reconfigurable accelerator designs

    Publication Year: 2011 , Page(s): 8 - 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (642 KB) |  | HTML iconHTML  

    In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across designs difficult. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconfigurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template provides generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement five important scientific kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, a well-known architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efficiency on a large Xilinx Virtex-6 device to that of several recent GPU studies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Customized MPSoC synthesis for task sequence

    Publication Year: 2011 , Page(s): 16 - 21
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (306 KB) |  | HTML iconHTML  

    Multiprocessor System-on-Chip (MPSoC) platforms have become increasingly popular for high-performance embedded applications. Each processing element (PE) on such platforms can be tuned to match the computational demands of the tasks executing on it, creating a heterogeneous multiprocessor system. Extensible processor cores, where the base instruction-set architecture can be augmented with application-specific custom instructions, have recently emerged as flexible building blocks for heterogeneous MPSoC platforms. However, the customization of the different PEs has to be carried out in a synergistic manner so as to create an optimal system. In this work, we propose a pseudo-polynomial time algorithm to design the most resource-efficient customized MPSoC platform for mapping linear task graphs representing streaming applications, under deadline constraints. Experimental validation with MP3 encoder and MPEG-2 encoder applications confirms the efficiency of our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating formal verification and high-level processor pipeline synthesis

    Publication Year: 2011 , Page(s): 22 - 29
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1523 KB) |  | HTML iconHTML  

    When a processor implementation is synthesized from a specification using an automatic framework, this implementation still should be verified against its specification to ensure the automatic framework introduced no error. This paper presents our effort in integrating fully automated formal verification with a high-level processor pipeline synthesis framework. As an integral part of the pipeline synthesis, our framework also emits SMV models for checking the functional equivalence between the output pipelined processor implementation and its input non-pipelined specification. Well known compositional model checking techniques are automatically applied to curtail state explosion during model checking. The paper reports case studies of applying this integrated framework to synthesize and formally verify pipelined RISC and CISC processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • USHA: Unified software and hardware architecture for video decoding

    Publication Year: 2011 , Page(s): 30 - 37
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2288 KB) |  | HTML iconHTML  

    Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modular high-throughput and low-latency sorting units for FPGAs in the Large Hadron Collider

    Publication Year: 2011 , Page(s): 38 - 45
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (273 KB) |  | HTML iconHTML  

    This paper presents efficient techniques for designing high-throughput, low-latency sorting units for FPGA implementation. Our sorting units use modular design techniques that hierarchically construct large sorting units from smaller building blocks. They are optimized for situations in which only the M largest numbers from N inputs are needed; this situation commonly occurs in high-energy physics experiments and other forms of digital signal processing. Based on these techniques, we design parameterized, pipelined sorting units. A detailed analysis indicates that their resource requirements scale linearly with the number of inputs, latencies scale logarithmically with the number of inputs, and frequencies remain fairly constant. Synthesis results indicate that a single pipelined 256-to-4 sorting unit with 19 stages can perform 200 million sorts per second with a latency of about 95 ns per sort on a Virtex-5 FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory-efficient volume ray tracing on GPU for radiotherapy

    Publication Year: 2011 , Page(s): 46 - 51
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (422 KB) |  | HTML iconHTML  

    Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many radiation dose calculation methods in radiotherapy. Recent advances of the graphics processing units (GPU) help real-time dose calculation become a reachable goal. However, the performance of the known GPU methods for volume ray tracing is all bounded by the memory-throughput, which leads to inefficient usage of the GPU computational capacity. This paper introduces a simple yet effective ray tracing technique aiming to improve the memory bandwidth utilization of GPU for processing a massive number of rays. The idea is to exploit the coherent relationship between the rays and match the ray tracing behavior with the underlying characteristics of the GPU memory system. The proposed method has been evaluated on 4 phantom setups using randomly generated rays. The collapsed-cone convolution/superposition (CCCS) dose calculation method is also implemented with/without the proposed approach to verify the feasibility of our method. Compared with the direct GPU implementation of the popular 3DDDA algorithm, the new method provides a speedup in the range of 1.8-2.7X for the given phantom settings. Major performance factors such as ray origins, phantom sizes, and pyramid sizes are also analyzed. The proposed technique was also shown to lead to a speedup of 1.3-1.6X over the original GPU implementation of the CCCS algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • System integration of Elliptic Curve Cryptography on an OMAP platform

    Publication Year: 2011 , Page(s): 52 - 57
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB) |  | HTML iconHTML  

    Elliptic Curve Cryptography (ECC) is popular for digital signatures and other public-key crypto-applications in embedded contexts. However, ECC is computationally intensive, and in particular the performance of the underlying modular arithmetic remains a concern. We investigate the design space of ECC on TI's OMAP 3530 platform, with a focus on using OMAP's DSP core to accelerate ECC computations for the ARM Cortex A8 core. We examine the opportunities of the heterogeneous platform for efficient ECC, including the efficient implementation of the underlying field multiplication on the DSP, and the design partitioning to minimize the communications overhead between ARM and DSP. By migrating the computations to the DSP, we demonstrate a significant speedup for the underlying modular arithmetic with up to 9.24x reduction in execution time, compared to the implementation executing on the ARM Cortex processor. Prototype measurements show an energy reduction of up to 5.3 times. We conclude that a heterogeneous platform offers substantial improvements in performance and energy, but we also point out that the cost of inter-processor communication cannot be ignored. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ISIS: An accelerator for Sphinx speech recognition

    Publication Year: 2011 , Page(s): 58 - 61
    Cited by:  Papers (1)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (173 KB) |  | HTML iconHTML  

    The ability to naturally interact with devices is becoming increasingly important. Speech recognition is one well-known solution to provide easy, hands-free user-device interaction. However, speech recognition has significant computation and memory bandwidth requirements, making it challenging to offer at high performance, real-time and ultra-low power for handheld devices. In this paper, we present a speech recognition accelerator called ISIS. We show the overall execution flow of the accelerated speech recognition solution along with optimizations and the key metrics of performance, area and power. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically reconfigurable architecture for a driver assistant system

    Publication Year: 2011 , Page(s): 62 - 65
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (163 KB) |  | HTML iconHTML  

    Application-specific programmable processors are increasingly being replaced by FPGAs, which offer high levels of logic density, rich sets of embedded hardware blocks, and a high degree of customizability and reconfigurability. New FPGA features such as Dynamic Partial Reconfiguration (DPR) can be leveraged to reduce resource utilization and power consumption while still providing high levels of performance. In this paper, we describe our implementation of a dynamically reconfigurable multiple-target tracking (MTT) module for an automotive driver assistance system. Our module implements a dynamically reconfigurable filtering block that changes with changing driving conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA based parallel architecture implementation of Stacked Error Diffusion algorithm

    Publication Year: 2011 , Page(s): 66 - 69
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (259 KB) |  | HTML iconHTML  

    Digital halftoning is a crucial technique used in digital printers to convert a continuous-tone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This manuscript focuses on the development, design and Hardware Description Language (HDL) functional and performance simulation validation of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. A CMYK printer, utilizing the high quality error diffusion algorithm, would be required to execute error diffusion 16 times per pixel, resulting in a potentially high computational cost. The algorithm, originally described in `C', requires a significant processing time when implemented on a conventional single Central Processing Unit (CPU) based computer system. Thus, a new scalable high performance parallel hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a single Programmable Logic Device (PLD) based Field Programmable Gate Array (FPGA) chip. There is a significant decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU based system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications

    Publication Year: 2011 , Page(s): 70 - 73
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (214 KB) |  | HTML iconHTML  

    GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs per dimension). While this application is clearly bandwidth bounded, the difference on the memory subsystems translates to different bandwidth optimization techniques. Our implementations on the GPU and FPGA platforms show 26X and 33X speedup respectively over optimized single-thread code on CPU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching

    Publication Year: 2011 , Page(s): 74 - 77
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (795 KB) |  | HTML iconHTML  

    The availability of huge amounts of nucleotide sequences catalyzes the development of fast algorithms for approximate DNA and RNA string matching. However, most existing online algorithms can only handle small scale problems. When querying large genomes, their performance becomes unacceptable. Offline algorithms such as Bowtie and BWA require building indexes, and their memory requirement is high. We have developed a fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching by exploiting the huge computational power of modern GPU hardware. Our CUDA program is capable of searching large genomes for patterns of length up to 64 with edit distance up to 9. For example, it is able to search the entire human genome (3.10 Gbp in 24 chromosomes) for patterns of lengths of 30 and 60 with edit distances of 3 and 6 within 371 and 1,188 milliseconds respectively on one NVIDIA GeForce GTX285 graphics card, achieving 70-fold and 36-fold speedups over multithreaded QuadCore CPU counterpart. Our program employs online approach and does not require building indexes of any kind, it thus can be applied in real time. Using two-bits-for-one-character binary representation, its memory requirement is merely one fourth of the original genome size. Therefore it is possible to load multiple genomes simultaneously. The x86 and x64 executables for Linux and Windows, C++ source code, documentations, user manual, and an AJAX MVC website for online real time searching are available at http://agrep.cse.cuhk.edu.hk. Users can also send emails to CUDAagrepGmail.com to queue up for a job. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frameworks for GPU Accelerators: A comprehensive evaluation using 2D/3D image registration

    Publication Year: 2011 , Page(s): 78 - 81
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (167 KB) |  | HTML iconHTML  

    In the last decade, there has been a dramatic growth in research and development of massively parallel many-core architectures like graphics hardware, both in academia and industry. This changed also the way programs are written in order to leverage the processing power of a multitude of cores on the same hardware. In the beginning, programmers had to use special graphics programming interfaces to express general purpose computations on graphics hardware. Today, several frameworks exist to relieve the programmer from such tasks. In this paper, we present five frameworks for parallelization on GPU Accelerators, namely RapidMind, PGI Accelerator, HMPP Workbench, OpenCL, and CUDA. To evaluate these frameworks, a real world application from medical imaging is investigated, the 2D/3D image registration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A massively parallel implementation of QC-LDPC decoder on GPU

    Publication Year: 2011 , Page(s): 82 - 85
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (298 KB) |  | HTML iconHTML  

    The graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder. The paper shows the efforts we made to map the algorithm onto the massively parallel architecture of GPU and fully utilize GPU's computational resources to significantly boost the performance. Moreover, several efficient data structures have been proposed to reduce the memory access latency and the memory bandwidth requirement. Experimental results show that the proposed GPU-based LDPC decoding accelerator can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ARTE: An Application-specific Run-Time management framework for multi-core systems

    Publication Year: 2011 , Page(s): 86 - 93
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (427 KB) |  | HTML iconHTML  

    Programmable multi-core and many-core platforms increase exponentially the challenge of task mapping and scheduling, provided that enough task-parallelism does exist for each application. This problem worsens when dealing with small ecosystems such as embedded systems-on-chip. In fact, in this case, the assumption of exploiting a traditional operating system is out of context given the memory available to satisfy the run-time footprint of such a configuration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A hardware acceleration technique for gradient descent and conjugate gradient

    Publication Year: 2011 , Page(s): 94 - 101
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (528 KB) |  | HTML iconHTML  

    Application Robustification, a promising approach for reducing processor power, converts applications into numerical optimization problems and solves them using gradient descent and conjugate gradient algorithms. The improvement in robustness, however, comes at the expense of performance when compared to the baseline non-iterative versions of these applications. To mitigate the performance loss from robustification, we present the design of a hardware accelerator and corresponding software support that accelerate gradient descent and conjugate gradient based iterative implementation of applications. Unlike traditional accelerators, our design accelerates different types of linear algebra operations found in many algorithms and is capable of efficiently handling sparse matrices that arise in applications such as graph matching. We show that the proposed accelerator can provide significant speedups for iterative versions of several applications and that for some applications such as least squares, it can substantially improve the computation time as compared to the baseline non-iterative implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-threaded coarse-grained array processor for wireless baseband

    Publication Year: 2011 , Page(s): 102 - 107
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (190 KB) |  | HTML iconHTML  

    Throughput of wireless communication standards ever increases. Computation requirements for systems implementing those standards increase even more. On battery operated devices, next to high performance a low power implementation is also crucial. Reaching this is only possible by utilizing parallelizations at all levels. The ADRES processor is an embedded coarse-grained reconfigurable baseband processor that already could exploit Data Level Parallelism (DLP), Instruction Level Parallelism (ILP) efficiently. In this paper we present extensions to ADRES to also exploit Task Level Parallelism (TLP) efficiently. We show how we reduce the overhead in communication and synchronization between tasks and demonstrate this on a mapping of an 802.11n 300Mbps standard. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware/software co-designed accelerator for vector graphics applications

    Publication Year: 2011 , Page(s): 108 - 114
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1356 KB) |  | HTML iconHTML  

    This paper proposes a new hardware accelerator to speed up the performance of vector graphics applications on complex embedded systems. The resulting hardware accelerator is synthesized on a field-programmable gate array (FPGA) and integrated with software components. The paper also introduces a hardware/software co-verification environment which provides in-system at-speed functional verification and performance evaluation to verify the hardware/software integrated architecture. The experimental results demonstrate that the integrated hardware accelerator is fifty times faster than a compiler-optimized software component and it enables vector graphics applications to run nearly two times faster. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable object detection accelerators on FPGAs using custom design space exploration

    Publication Year: 2011 , Page(s): 115 - 121
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (663 KB) |  | HTML iconHTML  

    We discuss FPGA implementations of object (such as face) detectors in video streams using the accurate Haar-feature based algorithm. Rather than creating one implementation for one FPGA, we develop a method to generate a series of implementations that have different size and performance to target different FPGA devices. The automatic generation was enabled by custom design space exploration on a particular design problem relating to the communication architecture used to support different numbers of image classifiers. The exploration algorithm uses content information in each feature set to optimize and generate a scalable communication architecture. We generated fully-working implementations for Xilinx Virtex5 LX50T, LX110T, and LX155T FPGA devices, using various amounts of available device capacity, leading to speedups ranging from 0.6x to 25x compared to a 3.0 GHz Pentium 4 desktop machine. Automated generators that include custom design space exploration may become more necessary when creating hardware accelerators intended for use across a wide range of existing and future FPGA devices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A parallel accelerator for semantic search

    Publication Year: 2011 , Page(s): 122 - 128
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (346 KB) |  | HTML iconHTML  

    Semantic text analysis is a technique used in advertisement placement, cognitive databases and search engines. With increasing amounts of data and stringent response-time requirements, improving the underlying implementation of semantic analysis becomes critical. To this end, we look at Supervised Semantic Indexing (SSI), a recently proposed algorithm for semantic analysis. SSI ranks a large number of documents based on their semantic similarity to a text query. For each query, it computes millions of dot products on unstructured data, generates a large intermediate result, and then performs ranking. SSI underperforms on both state-of-the-art multi-cores as well as GPUs. Its performance scalability on multi-cores is hampered by their limited support for fine-grained data parallelism. GPUs, though beat multi-cores by running thousands of threads, cannot handle large intermediate data because of their small on-chip memory. Motivated by this, we present an FPGA-based hardware accelerator for semantic analysis. As a key feature, the accelerator combines hundreds of simple processing elements together with in-memory processing to simultaneously generate and process (consume) the large intermediate data. It also supports “dynamic parallelism” - a feature that configures the PEs differently for full utilization of the available processin logic after the FPGA is programmed. Our FPGA prototype is 10-13x faster than a 2.5 GHz quad-core Xeon, and 1.5-5x faster than a 240 core 1.3 GHz Tesla GPU, despite operating at a modest frequency of 125 MHz. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel parallel Tier-1 coder for JPEG2000 using GPUs

    Publication Year: 2011 , Page(s): 129 - 136
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (398 KB) |  | HTML iconHTML  

    The JPEG2000 image compression standard provides superior features to the popular JPEG standard; however, the slow performance of software implementation of JPEG2000 has kept it from being widely adopted. More than 80% of the execution time for JPEG2000 is spent on the Tier-1 coding engine. While much effort over the past decade has been devoted to optimizing this component, its performance still remains slow. The major reason for this is that the Tier-1 coder consists of highly serial operations, each operating on individual bits in every single bit plane of the image samples. In addition, in the past there lacked an efficient hardware platform to provide massively parallel acceleration for Tier-1. However, the recent growth of general purpose graphic processing unit (GPGPU) provides a great opportunity to solve the problem with thousands of parallel processing threads. In this paper, the computation steps in JPEG2000 are examined, particularly in the Tier-1, and novel, GPGPU compatible, parallel processing methods for the sample-level coding of the images are developed. The GPGPU-based parallel engine allows for significant speedup in execution time compared to the JasPer JPEG2000 compression software. Running on a single Nvidia GTX 480 GPU, the parallel wavelet engine achieves 100× speedup, the parallel bit plane coder achieves more than 30× speedup, and the overall Tier-1 coder achieves up to 17× speedup. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.