By Topic

Computer Systems Architecture Conference, 2008. ACSAC 2008. 13th Asia-Pacific

Date 4-6 Aug. 2008

Filter Results

Displaying Results 1 - 25 of 59
  • Concurrency engineering

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (567 KB) |  | HTML iconHTML  

    This is a discussion paper on a very important topic that is about to become mainstream. It deals with the issues of software engineering in concurrent systems. It introduces this topic and illustrates the arguments for a change of perspective. It underlines these arguments with two examples, an asynchronous stream-based programming model and an asynchronous thread-based virtual machine model. Both support concurrency on very different abstractions but both capture similar support for concurrency engineering. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An ALU cluster with floating point unit for media streaming architecture with homogeneous processor cores

    Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (761 KB) |  | HTML iconHTML  

    Recent research shows the stream processing model is suitable for portable media applications. However, previous implementations of stream processors are suffered from their power consumption and cost for chip area. Thus, these designs focus on super computer architecture and scientific computation instead of real-time media applications. This paper proposes an arithmetic logic unit (ALU) cluster with Advanced Microcontroller Bus Architecture (AMBA) platform interface, which is utilized as a reconfigurable hardware accelerator for portable media applications. The proposed design is implemented and fabricated using TSMC 0.15 um technology with backend magnetic RAM (MRAM) process integration. Floating point unit (FPU) improves 3.2 times higher averagely of performance and only increases 10.8% area overhead. The measurement result also reveals double power efficiency over previous designs using traditional architectures. Outstanding area-performance trade-off efficiency in FPU and homogeneous cores, power efficiency and design methodologies of this work contribute a turnkey solution for modern portable multimedia devices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hybrid protocol for Cluster-based wireless sensor networks

    Page(s): 1 - 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4303 KB) |  | HTML iconHTML  

    Most of wireless sensor network (WSN) research topics consider how to save the energy of the sensor node. However, in some WSN applications, such as in the monitoring of an earthquake or forest wildfire, transmitting emergency data packets to the sink node in time is much more important then saving power. In this letter, we propose a hybrid cluster-based (HC) WSN model. The HC WSN model provides a feasible cluster-based WSN architecture, which can save the energy of the sensor node when things are normal, but will transmit emergency data packets in an efficient manner to the sink node from the simulation analysis during an emergency. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware transactional memory system for parallel programming

    Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1310 KB) |  | HTML iconHTML  

    Hardware transactional memory (HTM) is an attractive research topic in recent years. It has great potential to simplify parallel programming on the soon-to-be-ubiquitous multi-core systems. In this paper, a HTM design is proposed, and overall performance is evaluated. This HTM design distinguishes itself from others by its best effort philosophy. The hardware makes best effort to complete each transaction and software handles those transactions that cannot be completed by hardware. This design seeks a balance between application performance and hardware implementation complexity, and tries to answer the question: what should be done by hardware and what should be done by software. The overall performance of benchmarks is also evaluated by simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RAID10L: A high performance RAID10 storage architecture based on logging technique

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2073 KB) |  | HTML iconHTML  

    RAID10 storage system suffers the relatively poor write performance due to the write request must be served by both disks in a mirror set. To address this problem, in this paper we propose a novel RAID10 storage architecture, called RAID10L, which extends the data mirroring redundancy of RAID10 by incorporating a dedicated log disk. The goal of RAID10L is to significantly improve the write performance of RAID10 at some little expense of reliability. In RAID10L both read and write requests are processed in a balance scheme. For every write request, RAID10L keeps two copies of the write data: one in its normal place of data disk chosen by a write balance scheme and the other in the log disk by writing sequentially. The update to another data disk in a mirror set is delayed to the next quiet period between bursts of client activity. Reliability analysis shows that the reliability of RAID10L, in terms of MTTDL (mean time to data loss), is somewhat worse than RAID10 but much better than RAID5. On the other hand, our prototype implementation of RAID10L driven by Iometer benchmark shows that RAID10L outperforms RAID10 by up to 47.1% and RAID0 by 27.3% in terms of average response time. Driven by some real-life traces, RAID10L gains improvement up to 30.7% with an average of 27.7% than RAID10 in terms of average response time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of a precision time protocol over low rate wireless personal area networks

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4160 KB) |  | HTML iconHTML  

    Time synchronization is essential for a number of network applications. As the era of ubiquitous computing is ushered in, high precision time synchronization of nodes in wireless networks is required. High precision time synchronization can enable a variety of extensions of applications. This paper includes the design and implementation of the precision time protocol over low rate wireless personal area networks (LR-WPANpsilas). To achieve high precision in LR-WPANpsilas, we analyze the factors of latency and jitter in wireless environments, and we aim to minimize these factors. In addition, this paper presents experiments and the performance evaluation of the precision time protocol in LR-WPANpsilas. The result is that we established for nodes in a network to maintain their clocks to within a 50 nanosecond offset from the reference clock. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Resource sharing control in Simultaneous MultiThreading microarchitectures

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (572 KB) |  | HTML iconHTML  

    Simultaneous multithreading (SMT) achieves improved system resource utilization and accordingly higher instruction throughput because it exploits thread-level parallelism (TLP) in addition to conventional instruction-level parallelism (ILP). The key to high-performance SMT is to optimize the distribution of shared system resources among the threads. However, existing dynamic sharing mechanism has no control over the resource distribution, which could cause one thread to grab too many resources and clog the pipeline. Existing fetch policies address the resource distribution problem only indirectly. In this work, we strive to quantitatively determine the balance between controlling resource allocation and dynamic sharing of different system resources with their impact on the performance of SMT processors. We find that controlling the resource sharing of either the instruction fetch queue (IFQ) or the reorder buffer (ROB) is not sufficient if implemented alone. However, controlling the resource sharing of both IFQ and ROB can yield an average performance gain of 38% when compared with dynamic sharing case. The average L1 D-cache miss rate has been reduced by 33%. The average time that the instruction resides in the pipeline has been reduced by 34%. This demonstrates the power of the resource sharing control mechanism we propose. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance models for Cluster-enabled OpenMP implementations

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1769 KB) |  | HTML iconHTML  

    A key issue for cluster-enabled OpenMP implementations based on software distributed shared memory (sDSM) systems, is maintaining the consistency of the shared memory space. This forms the major source of overhead for these systems, and is driven by the detection and servicing of page faults. This paper investigates how application performance can be modelled based on the number of page faults. Two simple models are proposed, one based on the number of page faults along the critical path of the computation, and one based on the aggregated numbers of page faults. Two different sDSM systems are considered. The models are evaluated using the OpenMP NAS parallel benchmarks on an 8-node AMD-based Gigabit Ethernet cluster. Both models gave estimates accurate to within 10% in most cases, with the critical path model showing slightly better accuracy; accuracy is lost if the underlying page faults cannot be overlapped, or if the application makes extensive use of the OpenMP flush directive. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Designs of the basic block reassembling Instruction Stream Buffer for X86 ISA

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5588 KB) |  | HTML iconHTML  

    The potential performance of superscalar processors can be exploited only when processor is fed with sufficient instruction bandwidth. The front-end units, the Instruction Stream Buffer (ISB) and the fetcher, are the key elements for achieving this goal. Current ISBs cannot support instruction streaming beyond a basic block. In X86 processors, the split-line instruction problem worsens this situation. We proposed a basic blocks reassembling ISB in this paper. By cooperating with the proposed Line Weighted Branch Target Buffer (LWBTB), the ISB can predict advance branch information and reassemble cache lines. Front-End could fetch more valid instructions in a cycle by reassembling the original line containing instructions for the next basic block. Simulation results show that the cache line size over 64 bytes has a good chance to let two basic blocks in the reassembled instruction line and the fetch efficiency is about 90% as the fetch capacity is under 6. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The potential of fine-grained value prediction in enhancing the performance of modern parallel machines

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (489 KB) |  | HTML iconHTML  

    The newly emerging many-core-on-a-chip designs have renewed an intense interest in parallel processing. By applying Amdahlpsilas formulation to the programs in the PARSEC and SPLASH-2 benchmark suites, we find that most applications may not have enough parallelism for modern parallel machines. However, value prediction techniques may allow the ldquoparallelizationrdquo of the sequential portion by predicting values before they are produced. We here extend Amdahlpsilas formulation to model the data redundancy inherent to each benchmark. Our analysis shows that the performance of PARSEC suite benchmarks may improve by a factor of 180.6% and 232.6% for the SPLASH-2 suite, compared to when only the intrinsic parallelism is considered. This demonstrates the immense potential of fine-grained value prediction in enhancing the performance of modern parallel machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • UML-based hardware/software co-design platform for dynamically partially reconfigurable network security systems

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (561 KB) |  | HTML iconHTML  

    We propose a UML-based hardware/software co-design platform for partially reconfigurable systems, targeting mainly at network security systems. Applications with heavy computing are implemented as the partially reconfigurable hardware tasks for enhancing the system performance and flexibility, which means that a network security embedded system can dynamically reconfigure one part of the system at run-time according to different security needs while other parts are still functioning. We further propose a partially reconfigurable hardware template, using which the users only need to integrate their hardware applications with the template without going through the full partial reconfiguration flow. The template has an average overhead of only 0:62% of the total resources in Xilinx Virtex-II XC2V3000 FPGA. Furthermore, our proposed platform includes a UML-based system model that can directly interact with the system hardware architecture. Compared to the synthesis based estimation methods with inaccuracy ranging from -23% to +234% for the execution time of dynamically partially reconfigurable hardware tasks, by using our platform users can directly measure the execution time and use them to validate system correctness and performance at a high-level phase, which significantly reduces the number of iterations in the system development. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA-based Equivalent Simulation Technology (FEST) for clustered stream architecture

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (8463 KB) |  | HTML iconHTML  

    Stream architecture research is often hindered by slow software simulations. Simulators based on FPGA are much faster. However, larger scale stream architecture simulation needs more FPGA resource, which may result in more FPGA chips or larger capacity FPGA chip would be used. It not only increases the complexity of design, but also increases the cost of research. This paper proposed FPGA-based equivalent simulation technology (FEST) and constructs an Equivalent model called FEST model based on it. FEST can support cluster-scalable simulation for clustered stream architecture well by replace some components by a simpler structure with equivalent function. The simulator based on FEST model (1) needs fewer FPGA resource than the original system but has little influence on simulation speed, (2) is accurate to cycle level resolution, (3) can run unmodified applications, (4) can reappear simulation results including resource consuming and timing analysis of original system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hamiltonian-cycle-based multicasting on wormhole-routed torus networks

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1872 KB) |  | HTML iconHTML  

    In this paper, we propose an efficient multipath multicast routing algorithm in wormhole-routed 2D torus networks. We first introduce a Hamiltonian cycle model for exploiting the feature of torus networks. Based on this model, we find a Hamiltonian cycle in torus networks. Then, an efficient multipath multicast routing algorithm with Hamiltonian cycle model (mulitpath-HCM) is presented. The proposed multipath multicast routing algorithm utilizes communication channels more uniformly in order to reduce the path length of the routing messages, making the multicasting more efficient. Simulation results show that the multicast latency of the proposed multipath-HCM routing algorithm is superior to that of fixed and dual-path routing algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel spatio-temporal adaptive bus encoding for reducing crosstalk interferences with trade-offs between performance and reliability

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (581 KB) |  | HTML iconHTML  

    Due to advanced process technologies the decreasing distance between wires has led to significant bus interferences that introduce crosstalk delay and noise. We first propose two encoding schemes, namely DUCE and GASIE, that can reduce crosstalk delay and noise on the bus lines. The DUCE scheme is a temporal encoding so it needs no additional bits to implement. It can be easily used in existing systems without additional modification in the hardware architecture. For improving performance, we propose a spatial encoding scheme called GASIE which has shielding lines protection and additional bits for transmitting control signals. Compared to existing spatial encoding methods, GASIE not only does not need any profiling information, but also achieves better results. Finally, we combine DUCE and GASIE into a novel spatio-temporal adaptive encoding (STAE) to tradeoff between performance and reliability. The experimental results for various applications showed significant reductions in the number of patterns that were most likely to produce crosstalk delay and errors. The two adjacent transitions and the aggressors can be completely eliminated in the DUCE scheme. While the GASIE scheme can achieve up to 59.9% average reduction of two aggressors, the STAE scheme gives a strongly error tolerant environment of 70% reduction in aggressors and adjacent transitions at the cost of 10% performance loss. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Early load: Hiding load latency in deep pipeline processor

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3803 KB) |  | HTML iconHTML  

    Load instructions usually have long execution latency in a deep processor pipeline, and have significant impact on overall performance. Therefore, how to hide the load latency becomes a serious problem in processor design. The latency of memory load can be separated into two parts: cache-miss latency and load-to-use latency. Previous work which tried to hide the load latency in a deep processor pipeline has some limitations. In this paper, we propose a hardware-based method, called early load, to hide the load-to-use latency with little hardware overhead. Early load scheme allows load instructions to load data from the cache system before it enters the execution stage. In the meantime, a detection method makes sure the correctness of the early operation before the load instruction enters the execution stage. Our experimental results showed that our approach can achieve 11.64% performance improvement in Dhrystone benchmark and 4.97% in average for MiBench benchmark suite. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic Data De-duplication for archival storage systems

    Page(s): 1 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4687 KB) |  | HTML iconHTML  

    In archival storage systems, there is a huge amount of duplicate data or redundant data, which occupy significant extra equipments and power consumptions, largely lowering down resources utilization (such as the network bandwidth and storage) and imposing extra burden on management as the scale increases. So data de-duplication, the goal of which is to minimize the duplicate data in the inter-file level, has been receiving broad attention both in academic and industry in recent years. In this paper, semantic data de-duplication (SDD) is proposed, which makes use of the semantic information in the I/O path (such as file type, file format, application hints and filesystem metadata) of the archival files to direct the dividing a file into semantic chunks (SC). While the main goal of SDD is to maximally reduce the inter-file level duplications, directly storing variable SCes into disks will result in a lot of fragments and involve a high percentage of random disk accesses, which is very inefficient. So an efficient data storage scheme is also designed and implemented: SCes are further packaged into fixed sized Objects, which are actually the storage units in the storage devices, so as to speed up the I/O performance as well as ease the data management. Primary experiments have demonstrated that SDD can further reduce the storage space compared with current methods (from 20% to near 50% according to different datasets), and largely improves the writing performance (about 50%-70% in average). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computation rotating for data reuse

    Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (526 KB) |  | HTML iconHTML  

    Loop tiling is an effective loop transformation technique that tiles the iteration space of loop nests to improve the data locality. The appropriate data layout and transfer strategies are also important to assist loop tiling. This paper describes an approach to enhance data reuse and reduce off-chip memory access after loop tiling. Data tiles due to loop tiling may have overlapped elements, which will lead to more larger data transfer cost. This also provides us with the challenge to exploit data reuse between data tiles. Using our approach we are able to reduce these unnecessary data transfers and improve the performance compared to traditional pure loop tiling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LogSPoTM: a scalable thread level speculation model based on transactional memory

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (431 KB) |  | HTML iconHTML  

    Thread level speculation (TLS) and transactional memory (TM) are both proposed to address the problem of productivity in multi-core era. Both of them require similar underlying support. In this paper, we propose a low-design-complexity approach to effective unified support for both TLS & TM, by extending a scalable TM model to support TLS. The baseline TM model chosen is LogTM. A distributed hardware arbitration mechanism is also proposed to intensify scalability. This method takes advantage of hardware resources introduced by TM, resulting simplified hardware design. Moreover, it provides rich semantics of both TLS and TM to programmers. Five representative benchmarks have been adopted to evaluate performance characteristics of our TLS model under different memory access patterns. Influence of design choices such as interconnection is also studied. The evaluation shows that the new system performs well in most of the benchmarks in spite of its simplicity, resulting average region speedups around 3.5 at four threads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A gather/scatter hardware support for efficient Fast Fourier Transform

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (812 KB) |  | HTML iconHTML  

    The increase of operating frequency of microprocessors has begun to meet more obstacles. Performance of single-thread applications no longer benefits from running under a faster processor. As a result, the performance increase has to come from additional hardware support which makes use of the large number of transistors available. This paper presents a novel hardware support called distTree to speed up processor performance. The distTree hardware automates gather and scatter operations for applications with complex but predictable memory access patterns like the fast Fourier transform (FFT). With this hardware support integrated with a modern microprocessor like Alpha, the FFT performance can reap an increase of over 100% when compared against the FFTW library, a state-of-the-art implementation. The distTree hardware support enables the processor to spend the majority of processor cycles on executing the computation operations of an algorithm by reducing both the arithmetic and address computation overhead. Therefore, the performance of many single-thread applications can be significantly increased. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast accurate rendering

    Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3879 KB) |  | HTML iconHTML  

    High quality (or photo-realistic) rendering is a computationally intense: farms of hundreds of servers take months to render movies with considerable special effects. However, considerable inherent parallelism means that the rendering time may be reduced by implementing key routines in hardware. Profiling of Pixie, an open source renderer, showed that ~ 95% of CPU cycles were used to calculate ray-triangle intersections. Implemented this routine for an FPGA showed speedups of 100, if data could be fed to the ray-triangle pipeline fast enough. Available busses have insufficient bandwidth, so we developed an architecture with most of the rendering pipeline on the FPGA surface. A key component of this architecture is a cache for object data which allows the system to render scenes of very high complexity (> 106 basic elements) using a usual memory hierarchy - bulk memory plus paging disc. The object cache retains commonly used objects, reducing the load on the system (eg PCI) bus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mechanism for return stack and branch history corrections under misprediction in deep pipeline design

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3013 KB) |  | HTML iconHTML  

    Return stack may be popped due to branch misprediction, corrupting its contents. Meanwhile, erroneous branch history is also recorded for upcoming branch predictions. These errors are more likely in deep pipelines, and their handling affects performance seriously. We study these issues, and propose solutions with two virtues: low hardware overhead, and high branch prediction accuracy comparable with that of a shallow pipeline design. To deal with return stack corruption, any push and pop after any mispredicted branch should be counted and recorded. These simple rules get overly complicated when multiple unresolved branches exist, and anyone may be erroneous, as is common in deeper pipelines. Next, to deal with branch history contamination, extra history bits, plus a branch confirmation pointer are needed. The experiment result shows that our design is effective, about 4%~9% performance improvement in MiBench. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Diva: A dataflow programming model and its runtime support in Java virtual machine

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1483 KB) |  | HTML iconHTML  

    Microprocessors have turned to multicore, i.e. multiple processor cores, along with some levels of on-chip caches and interconnection networks, integrated on a singe chip. However, it brings challenges on how to program these processors effectively and efficiently, which is known as the ldquoWallrdquo. This paper proposes a systematic approach to attack problem. We describe an extension of Java programming language with dataflow paradigm and transactional memory. Our approach alleviates the difficulties of parallel programming by providing a higher level of abstraction and relieving programmers of low-level threading and locking details. We also describe the design of a runtime system to support and optimize for the extension. We have implemented a prototype based on Apache Harmony DRL Virtual Machine. Preliminary experimental results on a 16-core SMP machine show that our approach achieves reasonable scalability and can adapt to the variance of available hardware resources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A partial memory protection scheme for higher effective yield of embedded memory for video data

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (692 KB) |  | HTML iconHTML  

    With the emerging SoC era the on-chip embedded memory will occupy most of the silicon real estate. As the technology proceeds into very deep submicron, the yield of SoCs will drop sharply mainly because of the on-chip memory failure. Therefore, the embedded memory is becoming the crucial part for achieving higher chip yield. In this paper, we propose an error-resilient video data memory system architecture design. The proposed scheme employs partial memory protection scheme rather than traditional whole memory protection. Our approach is based on the fact that video data memory need not to be error-free because multimedia data has built-in redundancies by their own nature and allows partial data loss without serious quality degradation. With our approach we can achieve 100% data memory yield while incurring a small power overhead. We demonstrate the efficiency of our approach with H.264 application up to 2.0% memory bit error. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallelization of spectral clustering algorithm on multi-core processors and GPGPU

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (961 KB) |  | HTML iconHTML  

    Spectral clustering is a widely-used algorithm in the field of information retrieval, data mining, machine learning and many others. It can help to cluster a large number of data into several categories without requiring any additional information about the dataset or the categories, so that people can find information by categories easily. In this paper, we parallelize the algorithm proposed by Andrew Y. Ng, Michael I. Jordan and Yair Weiss. We provide two versions of implementation: one is parallelized in OpenMP; the other is programmed in the NVIDIA CUDA (compute unified device architecture), which is the environment provided by NVIDIA to program on its CUDA-Enabled GPGPUs (general-purpose graphic processing unit). We can achieve about three times speedup in OpenMP and around ten times speedup using CUDA in our experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MediaMem: A dynamically adjustable memory subsystem for high-bandwidth required multimedia SoC systems

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (660 KB) |  | HTML iconHTML  

    Since the continuously growing of multimedia functionalities in modern portable consuming electronics, the computer systems have to integrate multiple media processors on single chip/system to provide better service. However, the insufficient bandwidth of the memory subsystem will make the performance of the multimedia modules unsatisfied. In this paper, we propose an innovative architecture of memory subsystem, aiming for extracting more potential bandwidth of memory access to fulfill the requirements of multiple multimedia processors dynamically. The proposed architecture, called MediaMem, can offers satisfied bandwidth for all attached multimedia processor by proposed two novel scheduling mechanisms that can dynamically adjust the access grants, buffer sizes, and transfer sequences according to real-time situations. Additionally, the memory interconnection is modified to avoid bus contention. The proposed MediaMem architecture has been implemented by SystemC HDL. The whole system functional verification and performance evaluation have been exam by CoWare ConvergenSC. The experimental results are also discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.