Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

ASIC, 2009. ASICON '09. IEEE 8th International Conference on

Date 20-23 Oct. 2009

Filter Results

Displaying Results 1 - 25 of 338
  • A LUT-based VRC model for random logic function evolution

    Publication Year: 2009 , Page(s): 1 - 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5426 KB) |  | HTML iconHTML  

    Recently, the VRC (virtual reconfigurable circuit) has become a mainstream solution for EHW (evolvable hardware) research. In this paper, A LUT-based VRC model is proposed, which can be applied for random logic function evolution. Different kinds of LUTs with appropriate interconnections were studied on a FPGA-based platform. Research were also performed in this platform to compare with the current VRC model such as VRC1 [SinMan] and VRC2 [Sekan-ina]. The results show that 3-input LUT with a direct interconnection achieves about 8% improvement in fitness value after 20,000 generations, and obtains obvious progress in logic resource utilization rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low power design of vlsi circuits and systems

    Publication Year: 2009 , Page(s): 17 - 20
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1371 KB) |  | HTML iconHTML  

    Power consumption is the bottleneck of system performance and is listed as one of the top three challenges in ITRS 2008. Low power design can be exploited at various levels, e.g., system level, architecture level, circuit level, and device level. This paper first gives a brief overview for low power optimization techniques at system and architecture level, then focus discussion on circuit level methods specifically state-of-the-art low power design techniques of clocking systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-channel, area-efficient, audio sampling rate interpolator

    Publication Year: 2009 , Page(s): 21 - 24
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2937 KB) |  | HTML iconHTML  

    The area and power consumption of sampled rate converter are governed largely by associated digital interpolation filters. This paper presents a novel multi-channel, area-efficient audio sampling rate interpolator, whose conversion ratio is 1:2 and 1:4. Several architectural and implementation features reduces the complexity of the filter and allow its realization in a die area of 0.032 mm2 in 0.18 ¿m technology, meanwhile timing multiplexer scheme reduces clock frequency to minimum. Experiments show proposed methods could not only saved hardware resources but also reduce power consumption, so it is very suitable for consumer electronics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-power MCML circuit with sleep-transistor

    Publication Year: 2009 , Page(s): 25 - 28
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3248 KB) |  | HTML iconHTML  

    This paper proposes a low-power MOS current mode logic circuit with sleep-transistor to reduce the leakage current. The sleep-transistor is used to high-threshold voltage transistor to minimize the leakage current. The 16×16 bit parallel multiplier is designed with the proposed technology. Comparing with the previous MOS current model logic circuit, the circuit achieves the reduction of the power consumption in sleep mode by 1/258. This circuit is designed with Samsung 0.35 ¿m CMOS process. The validity and effectiveness are verified through the HSPICE simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture design of variable lengths instructions expansion for VLIW

    Publication Year: 2009 , Page(s): 29 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3824 KB) |  | HTML iconHTML  

    In current instruction set architecture (ISA) design, fixed length instructions are benefit for improving the efficiency of instruction dispatching. But in embeded computers where memory is limited, variable lengths instructions are much better in memory cost. In this VLIW (very long instruction word) architecture, a two-staged pipeline is used to expand and dispatch the variable lengths instructions. When CPU receives a packet of instructions, a fixed number of instructions expand in the first pipeline phase. In the second pipeline phase, CPU dispatch instructions which execute in the same pipeline cycle. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A systolic architecture with linear space complexity for longest common subsequence problem

    Publication Year: 2009 , Page(s): 33 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5703 KB) |  | HTML iconHTML  

    The longest common subsequence (LCS) problem is to find an LCS of two strings and the length of the LCS (LLCS). Many previous works focused on reducing the processing time. However, most require too large memory space in total, resulting in being not suitable for hardware implementation. In this paper, we propose a hardware-implementable algorithm and its systolic architecture. The algorithm achieves linear space complexity, and the systolic architecture is feasible for hardware implementation. For two given strings with their lengths of m and n, the algorithm consumes less time complexity when the LLCS is approaching to the minimum of m and n. Furthermore, a scalable architecture is proposed to deal with the LCS problems of two huge strings, whose lengths are far more than m and n. Therefore, our scalable systolic architecture with linear space complexity for the LCS problem is suitable for hardware implementation, and the synthesized results show that our architecture is more efficient. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SFG realization of wavelet filter using switched-current circuits

    Publication Year: 2009 , Page(s): 37 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2055 KB) |  | HTML iconHTML  

    A signal flow graph method for analogue implementation of wavelet transform using switched-current circuits is proposed in which the wavelet transform is synthesized by a bank of switched-current bandpass filters whose impulse responses are the mother wavelet and its dilations. To facilitate the implementation of arbitrary wavelet function, the proposed approach employs the signal flow graph methodology to design the wavelet filter. The first derivative of Gaussian wavelet is used as an example in this paper to illustrate design details. Simulation results show the feasibility of the proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of a high reliable L1 data cache with redundant cache

    Publication Year: 2009 , Page(s): 41 - 45
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3769 KB) |  | HTML iconHTML  

    Modern high-performance processors utilize cache memory systems to tolerate the increasing latency of main memory. Along with IC technology improvement, complicated cache memory systems in processors are very vulnerable to soft errors under severe environment. To deal with multiple soft errors with little impact on hardware overhead and performance, this paper proposes a new cache memory system, in which redundant cache blocks are integrated into a set-associative L1 data cache. Each redundant cache block is used to store the replica of each ¿dirty¿data in correspondence with L1 data cache blocks. In order to realize the detection of multiple soft errors with little hardware overhead, a bit interleaving group parity code is adopted to detect multiple soft errors in L1 data cache blocks. Moreover, in order to increase the mapping rate between L1 data cache blocks and the redundant cache blocks, an early write-back based protocol is introduced, in which all dirty cache blocks are written back to L2 cache at the intervals of a determined cycle number. The proposed cache system can provide more powerful soft error protection than conventional error correction codes. Experiment results show that the cache system proposed in this paper can provide replicas for almost 100% of dirty cache blocks in L1 data cache on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Switching activity calculation of VLSI adders

    Publication Year: 2009 , Page(s): 46 - 49
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3003 KB) |  | HTML iconHTML  

    Using exact switching activity rates at all internal nodes when calculating energy of digital circuits is believed to result in improved accuracy over to the use of average switching activity.We compare the two approaches in the case of the Kogge-Stone adder implemented with Weinberger and Ling addition recurrences. The difference between the two is less than 4%.Further we examined the accuracy of the energy/delay estimation technique when using exact and average switching activities in 65 nm, 45 nm, 32 nm and 22 nm technology nodes. Even then the worse case error in estimating energy is under 15% at 22 nm technology node for 64-bit Kogge-Stone adder. The error in delay estimation is less than 6% for all the nodes. Our finding is that using average switching activity does not yield large errors while simplifying the estimation process greatly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A precise SystemC-AMS model for Charge Pump Phase Lock Loop with multiphase outputs

    Publication Year: 2009 , Page(s): 50 - 53
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3451 KB) |  | HTML iconHTML  

    SystemC-AMS as an extension of SystemC provides the essential capability to describe a mix-signal heterogeneous system, so that a virtual-prototype model can be generated to help analyze a whole mix-signal system and further guide the circuit design. This paper presents an example of systemC-ams modeling for a 10 phase 500 MHz charge pump phase lock loop, including digital models like phase/frequency detector and clock N-divider, and analog models like charge pump, low pass filter and voltage controlled oscillator. In order to prove the model's accuracy, the SPICE simulation result from the corresponding CMOS circuits based on the same structure of these models is used for comparison, and PLL systemC-AMS model is validated1. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A 2.84W 16port Switch ASIC for high performance computing systems

    Publication Year: 2009 , Page(s): 54 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5416 KB) |  | HTML iconHTML  

    The chip's structure, design trade-off, and physical implementation with power optimization of a 2.84 W switch ASIC, which is targeted for large scale parallel computing systems, are introduced in this paper. The chip supports not only multi-layer, multi-function packet switching with high throughput and low latency, but also provides advanced global barrier process accelerating between its 16 full-duplex ports. At 156.25 Mhz, the chip has 83.2 ns zero-load latency, 80 Gbps port-switching and 240 Gbps internal packet switching capacity with port's data throughput at 2 × 2.5 Gbps. The ASIC has been taped-out with 0.18 um/6metal CMOS technology, and has about 20 million transistors; 12.39 mm × 12.39 mm die size; with 1053 pin flip-chip package. The first pass silicon of this switch ASIC has successfully passed DFT, functional and system level testing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of multi-valued double-edge-triggered JK flip-flop based on neuron MOS transistor

    Publication Year: 2009 , Page(s): 58 - 61
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1575 KB) |  | HTML iconHTML  

    The neuron MOS transistor (neuMOS) is a new device with multi-input gates and one floating gate. It is capable of obtaining a weighted sum calculation of multi-input gates signals and then operating the threshold based on the result of summation, thereby simulating the function of biological neurons. The neuron MOS transistor' characteristics about multiple input gates and the floating gate capacitance coupling effect can be used to solve the output multi-valued problem. Through studying the design principles of multi-valued logic circuit and the redundant suppression method, this paper presents a design scheme of multi-valued double-edge-triggered JK flip-flop. Compared with the conventional multi-valued JK flip-flop, this circuit has the characteristic of reduction the redundant leap of clock, low power consumption and fast speed etc. Furthermore, the proposed scheme in this paper can be further apply to design higher radix multi-valued circuits. Finally, the above designed circuit is verified by PSPICE simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Trends of terascale computing Chips in the next ten years

    Publication Year: 2009 , Page(s): 62 - 66
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7047 KB) |  | HTML iconHTML  

    Moore's law steadily continues though facing a number of challenges. This paper identifies ongoing and desirable trends to exploit the technology capacity and further Moore's law for terascale on-chip computing architectures in the next ten years. Four foreseeable trends are: from single core to many cores, from bus-based to network-based interconnect, from centralized memory to distributed memory, and from 2D integration to 3D integration. We motivate these trends and show that the number of design choices for computing chips is increasing rapidly, leading to an exploding design space with uncountable opportunities for the innovative architect. Moreover, we envision that the multi-core Network-on-Chip will become an infrastructure backbone and accumulate many other infrastructural functions such as memory, power and resource management, testing and diagnostic services. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high power efficiency reconfigurable processor for multimedia processing

    Publication Year: 2009 , Page(s): 67 - 70
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3594 KB) |  | HTML iconHTML  

    Nowadays mobile multimedia raise the demand of higher performance, larger amounts of flexibility as well as strict energy constrains. Reconfigurable multimedia array processor (ReMAP) provides architecture with high programmable coarse-grained computational resources and flexible interconnect. To reduce the power consumption of memory access, we present an approach for computing control of the reconfigurable processor. By configuring the operation and settling down the data path of computational resources as initialization, data stream in processor accomplishing algorithm implement without access context memory frequently, which can achieve barely the same energy consumption as ASICs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DReNoC: A dynamically reconfigurable computing system based on network-on-chip

    Publication Year: 2009 , Page(s): 71 - 74
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2284 KB) |  | HTML iconHTML  

    A dynamically reconfigurable computing system based on network-on-chip (DReNoC) is proposed, which consists of computing nodes and communication nodes. The computing node is a complete coarse-grained dynamically reconfigurable SoC named DReSoC. And the DReSoCs communicate with each other through on chip network routers. The proposed DReNoC has been implemented on the ALTERA STRATIX II EP2S180 DSP development board with 48063 Combinational ALUTs and 26211 logic registers. Experimental result of 8?8 matrix sequential matrix multiplications showed that, compared with a single-core system-on-chip (SoC) based on the standard Nios II processor, the speed-up ratio can reach 124.91. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient parallel instruction execution method for VLIW DSP

    Publication Year: 2009 , Page(s): 75 - 78
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2022 KB) |  | HTML iconHTML  

    LILY is a high performance VLIW DSP processor for multimedia applications, developed by Tsinghua University. The processor classifies the instructions, and determines whether the instructions should be issued in parallel according to the order of the instructions. Under this parallelism, LILY processor is capable of saving one bit of operation code in the condition of inserting very few no operation (NOP) instructions. In addition, it is needed to design a corresponding assembler to accommodate the above new parallelism, which aids LILY to complete the highly efficient method. The evaluation results show satisfactory suitability of the processor for high performance applications, high code density, and small program code size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic context management for coarse-grained reconfigurable array DSP architecture

    Publication Year: 2009 , Page(s): 79 - 82
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2725 KB) |  | HTML iconHTML  

    This paper proposes a novel technique of dynamic context management scheme for coarse-grained reconfigurable array DSP architecture, which effectively reduces the power consumption and speedup reconfigurable process. The technique permits background loading of configuration data without interrupting the regular execution, overlapping computation with reconfiguration. And stored configurations can be switched dramatically reducing reconfiguration overhead if the next configuration is present in one of the alternate contexts. The proposed technique has been verified in ReMAP (Reconfigurable Multi-media Array Processors) with the Discrete Cosine Transform (DCT) of H.264 and its performance exceed other DSP and multimedia extension architectures by 1.2x to 6.2x. ReMAP was fabricated with SMIC's 0.18¿m CMOS process mainly for multimedia applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A reconfigurable architecture specific for the butterfly computing

    Publication Year: 2009 , Page(s): 83 - 86
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3285 KB) |  | HTML iconHTML  

    Morphosys is a reconfigurable single instruction multiple data (SIMD) architecture mainly composing of host core processor, reconfigurable cells array, frame buffer, context memory and direct memory access (DMA) module. As a common SIMD-based coarse-grained reconfigurable architecture, each context configuration and operation is based on the whole row or column function, which may be inefficient in some applications such as butterfly computing. In this paper, an improved reconfigurable architecture is proposed specific for butterfly computing application. The main work includes interconnection network design optimization, context memory architecture redefinition with quadrant binding, DMA channel addition and some other corresponding modification in reconfigurable cell. With these improvements, the new architecture can implement typical butterfly computing with cycle count about 5.53%~18.9% less than Morphosys. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal-Partition Based code compression for embedded processor

    Publication Year: 2009 , Page(s): 87 - 90
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2690 KB) |  | HTML iconHTML  

    Memory is one of the most restricted resources in embedded system. Code compression techniques address this issue by reducing the code size of programs. Huffman coding is the most common used coding method. But during the process of generating symbols from instruction, an experience-based partition way is usually used, which may cause information redundancy. This paper presents an optimal-partition based code compression (OPCC) method. Markov tree model is used to extract correlation between bits in instruction. A clustering algorithm is proposed to cluster bits with higher correlation into symbols. Experimental results show that this method could improve the average compression ratio by 4.1%. The decoder part is validated in Altera CycloneII FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • T2- TAM:Reusing infrastructure resource to provide parallel testing for NoC based Chip

    Publication Year: 2009 , Page(s): 91 - 96
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3766 KB) |  | HTML iconHTML  

    Reusing network-on-chip (NoC) as test-access-mechanism (TAM) has been adopted to transfer test data to embedded cores. However, an observation shows that compared to NoC-reuse TAM, some bus-based TAM are able to achieve better results in test time due to its fine-grained scheduling unit. This paper proposed a new TAM named Test Tree(T2). T2TAM could be built by reusing the hardware resources of routers instead of reusing the packet-based NoC. Though implementing DFT design on routers, the T2TAM can achieve wire utilization and adopts fine-grained basic scheduling. Besides, to address the problem of testing large number of homogeneous cores, T2-TAM is proposed to facilitate multicasting stimuli to homogeneous cores to save test time. Experimental results show that the test cycles could be reduced up to 38% in comparison with the work reusing NoC as TAM with only 0.3% DFT overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transaction level model of NoC based on SystemC

    Publication Year: 2009 , Page(s): 97 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3791 KB) |  | HTML iconHTML  

    This paper presents a transaction-level on-chip communication network model, including routers and links, which can be easily employed in a system-level system-on-chip simulation framework for early functional verification and architecture analysis. The model is capable of providing NoC's latency and throughput information during simulating process and developed in SystemC to achieve high simulation speed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An energy-aware heuristic constructive mapping algorithm for Network on Chip

    Publication Year: 2009 , Page(s): 101 - 104
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2991 KB) |  | HTML iconHTML  

    Network on chip (NoC) is a promising interconnection solution for the ever-increasing systems complexity and design productivity gap. Mapping the IP cores onto a given platform is an important phase of NoC design which can greatly affect the performance and energy consumption of the chip. In this paper, we analyze the preexistent mapping algorithms, and categorize them into three classes according to the tracks of obtaining the near-optimal mapping. We present a fast hybrid heuristic constructive algorithm, i.e. CMAP, to map cores onto NoC architectures with the goal of minimizing the total communication energy consumption. The algorithm is applied to two real applications and a series of task graphs generated by TGFF package. The accuracy, efficiency and scalability of the proposed algorithm are confirmed by comparing the results of our algorithm with other mapping algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speedup analysis of data-parallel applications on Multi-core NoCs

    Publication Year: 2009 , Page(s): 105 - 108
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3223 KB) |  | HTML iconHTML  

    As more computing cores are integrated onto a single chip, the effect of network communication latency is becoming more and more significant on multi-core network-on-chips (NoCs). For data-parallel applications, we study the model of parallel speedup by including network communication latency in Amdahl's law. The speedup analysis considers the effect of network topology, network size, traffic model and computation/communication ratio. We also study the speedup efficiency. In our multi-core NoC platform, a real data-parallel application, i.e. matrix multiplication, is used to validate the analysis. Our theoretical analysis and the application results show that the speedup improvement is nonlinear and the speedup efficiency decreases as the system size is scaled up. Such analysis can be used to guide architects and programmers to improve parallel processing efficiency by reducing network latency with optimized network design and increasing computation proportion in the program. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Uniform routing architecture for FPGA with embedded IP cores

    Publication Year: 2009 , Page(s): 109 - 112
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6445 KB) |  | HTML iconHTML  

    A uniform routing architecture is presented, which offers CLB, IOB and IP Cores an identical routing resource. At the same time, some method is adopted to optimize routing performance. With all unidirectional segmented lines and long lines with inserted tap buffers, this architecture is up to 9.8% faster compared with long lines without inserted buffers and on average 14.9% over bidirectional lines. Simulation results demonstrate our idea. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mixed optimization method in design of FC-2

    Publication Year: 2009 , Page(s): 113 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4220 KB) |  | HTML iconHTML  

    The mixed optimization method proposed in this paper combines analysis of multi-level protocols with extraction of single-level protocol flow chart to design the accelerating hardware to improve the performance of FC-2. We implement the accelerating hardware with 0.18 CMOS standard technology. Compared with the frame based design method, the proposed method can improve the performance by 4.5 times in normal communication and even more when encountering error sequences with the 2.8 times area of the frame based one. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.