By Topic

Design and Architectures for Signal and Image Processing (DASIP), 2012 Conference on

Date 23-25 Oct. 2012

Filter Results

Displaying Results 1 - 25 of 82
  • [Front cover]

    Publication Year: 2012 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (82 KB)  
    Freely Available from IEEE
  • [Title page]

    Publication Year: 2012 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (28 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2012 , Page(s): 1 - 5
    Save to Project icon | Request Permissions | PDF file iconPDF (87 KB)  
    Freely Available from IEEE
  • Welcome to DASIP 2012

    Publication Year: 2012 , Page(s): 1 - 5
    Save to Project icon | Request Permissions | PDF file iconPDF (366 KB)  
    Freely Available from IEEE
  • Session 1: Definition and implementation of image and signal processing algorithms chair: Bertrand Granado, LIP6

    Publication Year: 2012 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (59 KB)  
    Freely Available from IEEE
  • A low energy adaptive motion estimation hardware for H.264 Multiview Video Coding

    Publication Year: 2012 , Page(s): 1 - 6
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3725 KB) |  | HTML iconHTML  

    Multiview Video Coding (MVC) is the process of efficiently compressing stereo (2 views) or multiview video signals. The improved compression efficiency achieved by H.264 MVC comes with a significant increase in computational complexity. Temporal prediction and inter-view prediction are the most computationally intensive parts of H.264 MVC. Therefore, in this paper, we propose novel techniques for reducing the amount of computations performed by temporal and inter-view predictions in H.264 MVC. The proposed techniques reduce the amount of computations performed by temporal and inter-view predictions significantly with very small PSNR loss and bitrate increase. We also propose a low energy adaptive H.264 MVC motion estimation hardware for implementing the temporal and inter-view predictions including the proposed computation reduction techniques. The proposed hardware is implemented in Verilog HDL and mapped to a Xilinx Virtex-6 FPGA. The FPGA implementation is capable of processing 30*8=240 frames per second of CIF (352×288) size 8 view video sequence or 30*2=60 frames per second of VGA (640×480) size stereo (2 views) video sequence. The proposed techniques reduce the energy consumption of this hardware significantly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high performance and low energy intra prediction hardware for HEVC video decoding

    Publication Year: 2012 , Page(s): 1 - 8
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (431 KB) |  | HTML iconHTML  

    Intra prediction algorithm in the recently developed High Efficiency Video Coding (HEVC) standard has very high computational complexity. Therefore, in this paper, we propose novel techniques for reducing amount of computations performed by intra prediction algorithm in HEVC decoder, and therefore reducing energy consumption of intra prediction hardware in HEVC decoder. The proposed techniques significantly reduce the amount of computations performed by 4×4 and 8×8 luminance prediction modes with a small comparison overhead without any PSNR and bit rate loss. We also designed and implemented a high performance intra prediction hardware for 4×4 and 8×8 angular prediction modes including the proposed techniques for HEVC video decoding using Verilog HDL, and mapped it to a Xilinx Virtex 6 FPGA. The proposed techniques significantly reduce the energy consumption of the proposed hardware on this FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of fixed-point rounding operators for the VHDL-2008 standard

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    The contemporary design of sophisticated digital signal processing platforms involves the use of specifications at an increasingly raised abstraction level. This scheme is dictated by the ever growing divide between available circuit complexity and developer productivity. Algorithm developers tend to use very high-level programming languages such as MATLAB in order to rapidly and seamlessly generate low-level design facets such as ANSI C reference implementations and synthesizable HDL code. In this paper, a generic and parameterized implementation of fixed-point rounding operators in the VHDL hardware description language is introduced. Most hardware compilation frameworks either lack the support of these operators or provide specialized and non-portable implementations. Further, this is the first time that an implementation for these operators is being proposed, that can take advantage of features only present in the VHDL-2008 standard. Compared to existing fixed-point rounding, the proposed combinatorial designs achieve lower timing by about 30% with similar area demands for the case of signed arithmetic compared to rival designs when realized on FPGAs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Investigating performance variations of an optimized GPU-ported granulometry algorithm

    Publication Year: 2012 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3548 KB) |  | HTML iconHTML  

    In this article, we present an optimized GPU implementation of a granulometry algorithm which is used a lot in the study of material domain. The main contribution to this algorithm is the binarization of the input data which increases throughput while reducing data allocated memory space. Also, the optimized GPU implementation brings an order of magnitude speedup compared to a CPU multi-threaded implementation. Furthermore, we investigate the reasons why GPU performance drop for different input data dimensions. Three main factors are exposed: under-exploited threads, threadblocks and streaming multiprocessors. This study should help the reader understand the tight relation that exists between the CUDA programming paradigm and the gpu architecture as well as some main bottlenecks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Session 2 — Special session: Reconfigurable and adaptive architectures for image and signal processing co-chairs: Diana Göhringer, KIT and Sebastien Pillement, University of Rennes 1

    Publication Year: 2012 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • Energy-efficient heterogeneous reconfigurable sensor node for distributed structural health monitoring

    Publication Year: 2012 , Page(s): 1 - 8
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (841 KB) |  | HTML iconHTML  

    Distributed structural health monitoring (SHM) using wireless sensor nodes (WSN) requires frugal spending of the limited energy budget. We propose a reconfigurable heterogeneous architecture, combining a low-power micro-controller (MCU) with a Field-Programmable Gate Array (FPGA), as a means for energy-efficient in-sensor processing. Details covered include a generic communication interface between both computing units and several clock-management schemes for energy efficiency. We evaluate the architecture on the use-case of a Random Decrement (RD) algorithm and also consider additional pre-filtering to reduce the volume of wirelessly transmitted data. Compared to conventional low-power sensor nodes, we can reduce the energy required for data processing by up to 81%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing large-kernel 2-D filters using Impulse CoDeveloper

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    Bidimensional convolution is a low-level processing algorithm which is of great interest in many areas, but its high computational cost limits the size of the kernels, especially in real-time embedded systems. This work describes the process of designing 2-D filters with large kernels using the Impulse CoDeveloper™ electronic system-level tool by Impulse Accelerated Technologies. The proposed design includes an efficient management of the operations at the borders of the input array. Several kernel sizes have been tested, ranging from 20×20 coefficients to 50×50 coefficients, with different bits-per-pixel configurations for both the kernel coefficients and the input data. In each case, performance is reported in terms of area utilization and minimum clock period. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Middleware based executive for embedded reconfigurable platforms

    Publication Year: 2012 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (355 KB) |  | HTML iconHTML  

    This paper presents a method to virtualize the communications into a distributed heterogeneous embedded Multiprocessor System-on-Chip (MPSoC) platform containing reconfigurable hardware computing units. We propose a new concept of middleware, implemented in software and in hardware to provide the designer a single programming interface. The middleware offers some mechanisms like access to distant operating system (OS) services and interprocess communication. It abstracts both implementation and mapping. The embedded application then executes regardless of where or how processes are implemented. We are currently validating the concept on a real-time image processing application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Partitioning and context switching for a reconfigurable FPGA-based DAB receiver

    Publication Year: 2012 , Page(s): 1 - 8
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (682 KB) |  | HTML iconHTML  

    The sequential execution of processing elements by time-multiplexing FPGA resources using single-island partial reconfiguration allows for resource-efficient designs in comparison to static FPGA implementations. Designing a processing chain for such a system requires the chain to be partitioned into reconfigurable modules, which can be sequentially executed. For this task, we will present an approach to partition an existing digital signal processing chain into separate modules with the goal to obtain a balanced logic occupation. Furthermore, we will show how the overhead of context switching can be reduced by frame-aware data processing and we will introduce a context-annotation scheme for synchronous data flow graphs. After applying our findings to a reconfigurable digital audio broadcasting receiver and quantifying the benefits and drawbacks of time-multiplexed execution, we will finally show that the time-multiplexed execution of receiver components decreases the resource consumption as compared to the static design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Session 3: Application-specific processor and co-processors for image and signal processing chair: Francesca Palumbo, University of Cagliari

    Publication Year: 2012 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (59 KB)  
    Freely Available from IEEE
  • An evaluation on using GPU coprocessing for software radios on a low-cost platform

    Publication Year: 2012 , Page(s): 1 - 8
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (294 KB) |  | HTML iconHTML  

    The presented study explores the characteristics of signal processing for software radio on heterogeneous x86/GPU system architectures. Special attention is set on the question whether the use of the GPU as a signal coprocessor helps to reduce the actual load of the x86 host processor. We focus on low-cost platforms with a chipset-integrated GPU next to the application processor, since they are coming close to embedded needs. As a case-study, we evaluate a complete software defined radio being capable of standard-conform, real-timed on-air radio reception of Digital Audio Broadcasting. We present the obtained benchmark results for the compute kernels which were ported to the GPU subsystem, but also compare them to an implementation optimized solely to the x86 host processor. When outsourcing the DAB SDR kernels to the GPU coprocessor, it becomes apparent that GPU housekeeping overhead today introduces more load to the host CPU than the CPU would spent for actually computing SDR kernels by itself. This is verified by a detailed system-wide analysis, treating also all use case related aspects beside the actual signal processing kernels. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Application-specific instruction processor for extracting local binary patterns

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (483 KB) |  | HTML iconHTML  

    Local Binary Pattern (LBP) is texture operator used in preprocessing for object detection, tracking, face recognition and fingerprint matching. Many of these applications are performed on embedded devices, which poses limitations on the implementation complexity and power consumption. As LBP features are computed pixelwise, high performance is required for real time extraction of LBP features from high resolution video. This paper presents an application-specific instruction processor for LBP extraction. The compact, yet powerful processor is capable of extracting LBP features from 1280 × 720p (30 fps) video with a reasonable 304 MHz clock rate. With a low power consumption and an area of less than 16k gates the processor is suitable for embedded devices. Experiments present resource and power consumption measured on an FPGA board, along with processor synthesis results. In terms of latency, our processor requires 17.5 × less clock cycles per LBP feature than a workstation implementation and only 2.0 × more than a hardwired ASIC. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Consumption analysis and estimation in the design of GStreamer based multimedia applications

    Publication Year: 2012 , Page(s): 1 - 7
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (578 KB) |  | HTML iconHTML  

    Multimedia application development on embedded systems requires highly sophisticated hardware/software frameworks. These frameworks make heavy use of available hardware resources and therefore the impact on the overall consumption is far from negligible. Being able to determine this impact is therefore essential. In order to tackle this challenge, this paper proposes a methodology for modeling the power and energy consumption of multimedia applications, which allows a rapid performance analysis of combined signal and image processing on complex embedded systems. This methodology is applied to the GStreamer multimedia framework running on Linux operating system for the OMAP3530 heterogeneous MPSoC. Our approach is based on measurements of the power consumption during the rendering of compressed audio/video streams. Resulting power models consider features of the source stream such as the duration, video resolution, audio sampling frequency and bitrate. The precision of our models is presented and finally compared to the actual measurements for two full multimedia clips. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flexible front-end processing for software defined radio applications using application specific instruction-set processors

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (273 KB) |  | HTML iconHTML  

    High computational demands of today's wireless communication standards require the design of highly flexible Software Defined Radio (SDR) platforms like the OpenAirInter-face ExpressMIMO platform. A DSP engine of major importance is the Front-End Processor (FEP) which deals with the different air-interface operations at the transceiver side. In this paper we propose an Application Specific Instruction-Set Processor (ASIP) architecture for front-end processing and compare it to a programmable DSP engine as well as to other ASIP solutions. For design comparison we mainly focus on architectural differences and the runtime performance in terms of processing time. The synthesis results are provided for different target technologies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Session 4: Multi-processor architectures for image and signal processing chair: Jean François Nezan, INSA Rennes

    Publication Year: 2012 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (59 KB)  
    Freely Available from IEEE
  • Architectural decomposition of video decoders for many core architectures

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1391 KB) |  | HTML iconHTML  

    The microprocessor industry trend towards many-core architectures introduced the necessity of devising appropriately scalable applications. In video decoding, the main challenges are the optimized partitioning of decoder operations, efficient tracking of dependencies and resource allocation/synchronization for multiple threads. In this paper, we propose a decoder architecture that replaces the conventional monolithic design with a pipelined structure. Bit stream decoding and image processing are separated from each other by means of a Meta Format Stream. The Meta Format is forward-oriented and self contained and multistandard capable, so that processing of Meta Streams is independent of the originating bit stream. Our approach does not require special coding settings and is applicable to accelerated decoding of any standards-compliant bit stream. A H.264 multiprocessing proposal is presented as a case study for the potential our our decoder architecture. The case study combines coarse grained frame-level parallel decoding of the bit stream with fine-grained macroblock level parallelism in the image processing stage. The proposed H.264 decoder achieved speedup factors of up to 7.6 on an 8 core machine with 2-way SMT. We are reporting actual decoding speeds of up to 150 frames per second in 2160p-resolution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HLS-based fast design space exploration of ad hoc hardware accelerators: A key tool for MPSoC synthesis on FPGA

    Publication Year: 2012 , Page(s): 1 - 8
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    Over the last decade many academic and industrial system synthesis and codesign tools have been proposed to designers. However most of these tools are based on IP Libraries but either these libraries are incomplete or are simply not adapted to the targets and constraints. It means that something important is missing when it comes to real implementations. We address this question in this paper and propose a flexible, fast and practical solution. We use high level synthesis (HLS) to obtain fast estimations of hardware accelerators that can then be embedded within the loop of a larger design space exploration flow. Once some solutions are selected they can be directly reuse to synthesise and produce real IPs. In this paper we present the approach and the tool as a key component of our heterogeneous multiprocessor synthesis framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the scalability of image and signal processing parallel applications on emerging cc-NUMA many-cores

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (884 KB) |  | HTML iconHTML  

    Nowadays, single-chip cache-coherent multi-cores up to 100 cores are a reality and many-cores of hundreds of cores are planned in the near future. This technological shift undertaking by the high-end computer-industry is converging with the design motivation of other domains like embedded and HPC industries. In this paper, we propose to investigate the scalability of the same four unmodified, shared-memory, image and signal processing oriented parallel applications on two targets: (i) embedded - TSAR, a single-chip 256-cores based, Cycle-Accurate-Bit-Accurate simulated, cc-NUMA many-core; and (ii) high-end - an AMD Opteron Interlagos, 64-core based, cc-NUMA many-core. Beside our scalability results on both cc-NUMA targets, our contributions include two operating system mechanisms: (i) a distributed, client/server based, scheduler design allowing the kernel to offer scalable inter-threads synchronization mechanisms; and (ii) a kernel-level memory affinity technique named Auto-Next-Touch allowing the kernel to transparently and automatically migrate physical pages in order to enforce the locality of thread's memory accesses. Although these two mechanisms are implemented and evaluated in ALMOS (Advanced Locality Management Operating System) running on the TSAR target, they remain applicable to other shared-memory operating systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Programmable routers for efficient mapping of applications onto NoC-based MPSoCs

    Publication Year: 2012 , Page(s): 1 - 8
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    We extend the state-of-the-art DSPIN network-on-chip architecture by defining programmable NoC routers that can establish effective static scheduling and routing of data packets as demanded by the application. Router programs are the result of a general compilation process which targets the NoC and the computing cores altogether. The objective is to reduce NoC contentions, improving speed and timing predictability. We consider the range of applications of such an approach and provide results on two of them (a simple embedded controller and an FFT). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Session 5 — Special session: Arithmetic for image and signal processing chair: Daniel Menard, University of Rennes 1

    Publication Year: 2012 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (61 KB)  
    Freely Available from IEEE