By Topic

Parallel Processing, 2002. Proceedings. International Conference on

Date 21-21 Aug. 2002

Filter Results

Displaying Results 1 - 25 of 69
  • Proceedings International Conference on Parallel Processing

    Save to Project icon | Request Permissions | PDF file iconPDF (144 KB)  
    Freely Available from IEEE
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): 631 - 632
    Save to Project icon | Request Permissions | PDF file iconPDF (198 KB)  
    Freely Available from IEEE
  • Minimal sensor integrity in sensor grids

    Page(s): 567 - 571
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB) |  | HTML iconHTML  

    Given the increasing importance of optimal sensor deployment for battlefield strategists, the converse problem of reacting to a particular deployment by an enemy is equally significant and not yet addressed in a quantifiable manner in the literature. We address this issue by modeling a two stage game in which the opponent deploys sensors to cover a sensor field and we attempt to maximally reduce his coverage at minimal cost. In this context, we introduce the concept of minimal sensor integrity which measures Me vulnerability of any sensor deployment. We find the best response by quantifying the merits of each response. While the problem of optimally deploying sensors subject to coverage constraints is NP-complete, in this paper we show that the best response (i.e. the maximum vulnerability) can be computed in polynomial time for sensors with arbitrary coverage capabilities deployed over points in any dimensional space. In the special case when sensor coverages form an interval graph (as in a linear grid), we describe a better O(Min(M2, NM)) dynamic programming algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multithreaded isosurface rendering on SMPs using span-space buckets

    Page(s): 572 - 580
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (402 KB) |  | HTML iconHTML  

    We present in-core and out-of-core parallel techniques for implementing isosurface rendering based on the notion of span-space buckets. Our in-core technique makes conservative use of the RAM and is amenable to parallelization. The out-of-core variant keeps the amount of data read in the search process to a minimum, visiting only the cells that intersect the isosurface. The out-of-core technique additionally minimizes disk I/O time through in-order seeking, interleaving data records on the disk and by overlapping computational and I/O threads. The overall isosurface rendering time achieved using our out-of-core span space buckets is comparable to that of well-optimized in-core techniques that have enough RAM at their disposal to avoid thrashing. When the RAM size is limited, our out-of-core span-space buckets maintains its performance level while in-core algorithms either start to thrash or must sacrifice performance for a smaller memory footprint. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance comparison of location areas and reporting centers under aggregate movement behavior mobility models

    Page(s): 445 - 452
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (282 KB) |  | HTML iconHTML  

    Location management deals with how to track mobile users within the cellular network. It consists of two basic operations: location update and paging. The total location management cost is the sum of the location update cost and the paging cost. Location areas and reporting centers are two popular location management schemes. The motivation for the study is the observation that the location update cost difference between the reporting centers scheme and the location areas scheme is small whereas the paging cost in the reporting centers scheme is larger than that in the location areas scheme. The paper compares the performance of the location areas scheme and the reporting centers scheme under aggregate movement behavior mobility models by simulations. Simulation results show that the location areas scheme performs about the same as the reporting centers scheme in two extreme cases, that is, when a few cells or almost all cells are selected as the reporting cells. However, the location areas scheme outperforms the reporting centers scheme at the 100% confidence level with all call-to-mobility ratios when the reporting cells divide the whole service area into several regions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Self-adapting backfilling scheduling for parallel systems

    Page(s): 583 - 592
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (318 KB) |  | HTML iconHTML  

    We focus on non-FCFS job scheduling policies for parallel systems that allow jobs to backfill, i.e., to move ahead in the queue, given that they do not delay certain previously submitted jobs. Consistent with commercial schedulers that maintain multiple queues where jobs are assigned according to the user-estimated duration, we propose a self-adapting backfilling policy that maintains multiple job queues to separate short from long jobs. The proposed policy adjusts its configuration parameters by continuously monitoring the system and quickly reacting to sudden fluctuations in the workload arrival pattern and/or severe changes in resource demands. Detailed performance comparisons via simulation using actual supercomputing, traces from the parallel workload archive indicate that the proposed policy consistently outperforms traditional backfilling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power aware scheduling for AND/OR graphs in multiprocessor real-time systems

    Page(s): 593 - 601
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (379 KB) |  | HTML iconHTML  

    Power aware computing has become popular recently and many techniques have been proposed to manage the energy consumption for traditional real-time applications. We have previously proposed (2001) two greedy slack sharing scheduling algorithms for such applications on multi-processor systems. In this paper, we are concerned mainly with real-time applications that have different execution paths consisting of different number of tasks. The AND/OR graph model is used to represent the application data dependence and control flow. The contribution of this paper is twofold. First, we extend our greedy slack sharing algorithm for traditional applications to deal with applications represented by AND/OR graphs. Then, using the statistical information about the applications, we propose a few variations of speculative scheduling algorithms that intend to save energy by reducing the number of speed changes (and thus the overhead) while ensuring that the applications meet the timing constraints. The performance of the algorithms is analyzed with respect to energy savings. The results obtained show that the greedy scheme is better than some speculative schemes and that the greedy scheme is good enough when a reasonable minimal speed exists in the system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An online heuristic for data placement in computer systems with active disks

    Page(s): 219 - 226
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (308 KB) |  | HTML iconHTML  

    In this paper, an online heuristic is proposed and evaluated, for managing the dynamic memory in a computer system with active disks, by physically colocating in disk memory or main memory, the data pages being accessed by a computation slice. This enables a runtime system that can offload the corresponding computation slice to the appropriate processing unit at the disk memory or main memory. A modified version of SEQUITUR, an online compression algorithm, is used to identify the affinity among sets of pages in a virtual memory page reference stream, and a page allocation and replacement policy is presented. The sets of pages identified are shown to closely match the sets of pages referenced by computation slices, using a suite of data access kernels as benchmarks. The paging policy is evaluated with page traces of micro benchmarks and real applications. In memory constrained environments, with additional memory at the disk, most of the benchmarks see improved performance, due to fewer page faults. The paging heuristic can colocate 50% of the affinity sets on average and can offload up to 100% of the computation to disk. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computational geometry on the OTIS-Mesh optoelectronic computer

    Page(s): 501 - 507
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (310 KB) |  | HTML iconHTML  

    We develop efficient algorithms for problems in computational geometry-convex hull, smallest enclosing box, ECDF two-set dominance, maximal points, all-nearest neighbor and closest-pair-on the OTIS-Mesh optoelectronic computer We also demonstrate the algorithms for computing convex hull and prefix sum with condition on a multi-dimensional mesh, which are used to compute convex hull and ECDF respectively. We show that all these problems can be solved in O(√N) time even with N2 inputs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware schemes for early register release

    Page(s): 5 - 13
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (523 KB) |  | HTML iconHTML  

    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Selective preemption strategies for parallel job scheduling

    Page(s): 602 - 610
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (342 KB) |  | HTML iconHTML  

    Although theoretical results have been established regarding the utility of pre-emptive scheduling in reducing average job turn-around time, job suspension/restart is not much used in practice at supercomputer centers for parallel job scheduling. A number of questions remain unanswered regarding the practical utility of pre-emptive scheduling. We explore this issue through a simulation-based study, using job logs from a supercomputer center We develop a tunable selective-suspension strategy, and demonstrate its effectiveness. We also present new insights into the effect of pre-emptive scheduling on different job classes and address the impact of suspensions on worst-case slowdown. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Honey, I shrunk the Beowulf!

    Page(s): 141 - 148
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1174 KB) |  | HTML iconHTML  

    In this paper, we present a novel twist on the Beowulf cluster - the Bladed Beowulf. Designed by RLX Technologies and integrated and configured at Los Alamos National Laboratory, our Bladed Beowulf consists of compute nodes made from commodity off-the-shelf parts mounted on motherboard blades measuring 14.7" × 4.7" × 0.58". Each motherboard blade (node) contains a 633 MHz Trans-meta TM5600™ CPU, 256 MB memory, 10 GB hard disk, and three 100-Mb/s Fast Ethernet network interfaces. Using a chassis provided by RLX, twenty-four such nodes mount side-by-side in a vertical orientation to fit in a rack-mountable 3U space, i.e., 19" in width and 5.25" in height. A Bladed Beowulf can reduce the total cost of ownership (TCO) of a traditional Beowulf by a factor of three while providing Beowulf-like performance. Accordingly, rather than use the traditional definition of price-performance ratio where price is the cost of acquisition, we introduce a new metric called ToPPeR: total price-performance ratio, where total price encompasses TCO. We also propose two related (but more concrete) metrics: performance-space ratio and performance-power ratio. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems

    Page(s): 360 - 368
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (380 KB) |  | HTML iconHTML  

    In this paper, we investigate an efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment. It provides more features and capabilities than existing algorithms that schedule only independent tasks in real-time homogeneous systems. In addition, the proposed algorithm takes the heterogeneities of computation, communication and reliability into account, thereby improving the reliability. To provide fault-tolerant capability, the algorithm employs a primary-backup copy scheme that enables the system to tolerate permanent failures in any single processor. In this scheme, a backup copy is allowed to overlap with other backup copies on the same processor, as long as their corresponding primary copies are allocated to different processors. Tasks are judiciously allocated to processors so as to reduce the schedule length as well as the reliability cost, defined to be the product of processor failure rate and task execution time. In addition, the time for detecting and handling a permanent fault is incorporated into the scheduling scheme, thus making the algorithm more practical. To quantify the combined performance of fault-tolerance and schedulability, the performability measure is introduced Compared with the existing scheduling algorithms in the literature, our scheduling algorithm achieves an average of 16.4% improvement in reliability and an average of 49.3% improvement in performability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The tracefile testbed - a community repository for identifying and retrieving HPC performance data

    Page(s): 177 - 184
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (419 KB) |  | HTML iconHTML  

    High-performance computing (HPC) programmers utilize tracefiles, which record program behavior in great detail, as the basis for many performance analysis activities. The lack of generally accessible tracefiles has forced programmers to develop their own testbeds in order to study the basic performance characteristics of the platforms they use. Since tracefiles serve as input to performance analysis and performance prediction tools, tool developers have also been hindered by the lack of a testbed for verifying and fine-tuning tool functionality, We created a community repository that meets the needs of both application and tool developers. In this paper, we describe how the tracefile testbed was designed to facilitate flexible searching and retrieval of tracefiles based on a variety of characteristics. Its Web-based interface provides a convenient mechanism for browsing, downloading, and uploading collections of tracefiles and tracefile segments, as well as viewing statistical summaries of performance characteristics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Worst case analysis of a greedy multicast algorithm in k-ary n-cubes

    Page(s): 511 - 518
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (511 KB) |  | HTML iconHTML  

    In this paper, we consider the problem of multicasting a message in k-ary n-cubes under the store-and-forward model. The objective of the problem is to minimize the size of the resultant multicast tree by keeping the distance to each destination over the tree the same as the distance in the original graph. In the following, we first propose an algorithm that grows a multicast tree in a greedy manner, in the sense that for each intermediate vertex of the tree, the outgoing edges of the vertex are selected in a non-increasing order of the number of destinations that can use the edge in a shortest path to the destination. We then evaluate the goodness of the algorithm in terms of the worst case ratio of the size of the generated tree to the size of an optimal tree. It is proved that for any k≥5 and n≥6, the performance ratio of the greedy algorithm is c×kn-o(n) for some constant 1/1.2≤c≤1/2. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and evaluation of scalable switching fabrics for high-performance routers

    Page(s): 167 - 174
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (558 KB) |  | HTML iconHTML  

    This work considers switching fabrics with distributed packet routing to achieve high scalability and low costs. The considered switching fabrics are based on a multistage structure with different re-circulation designs, where adjacent stages are interconnected according to the indirect n-cube connection style. They all compare favorably with an earlier multistage-based counterpart according to extensive simulation, in terms of performance measures of interest and hardware complexity. When queues are incorporated in the output ports of switching elements (SEs), the total number of stages required in our proposed fabrics to reach a given performance level can be reduced substantially. The performance of those fabrics with output queues is evaluated under different "speedups" of the queues, where the speedup is the operating clock rate ratio of that at the SE core to that over external links. Our simulation reveals that a small speedup of 2 is adequate for buffered switching fabrics comprising 4×8 SEs to deliver better performance than their unbuffered counterparts with 50% more stages of SEs, when the fabric size is 256. The buffered switching fabrics under our consideration are scalable and of low costs, ideally suitable for constructing high-performance routers with large numbers of line cards. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A lower-bound algorithm for minimizing network communication in real-time systems

    Page(s): 343 - 351
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (383 KB) |  | HTML iconHTML  

    In this paper, we propose a pseudo-polynomial-time lower-bound algorithm for the problem of assigning and scheduling real-time tasks in a distributed system such that the network communication is minimized The key feature of our algorithm is translating the task assignment problem into the so called k-cut problem of a graph, which is known to be solvable in polynomial time for fixed k. Experiments show that the lower bound computed by our algorithm in fact is optimal in up to 89% of the cases and increases the speed of an overall optimization algorithm by a factor of two on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A secure protocol for computing dot-products in clustered and distributed environments

    Page(s): 379 - 384
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (270 KB) |  | HTML iconHTML  

    Dot-products form the basis of various applications ranging from scientific computations to commercial applications in data mining and transaction processing. Typical scientific computations utilizing sparse iterative solvers use repeated matrix-vector products. These can be viewed as dot-products of sparse vectors. In database applications, dot-products take the form of counting operations. With widespread use of clustered and distributed platforms, these operations are increasingly being performed across networked hosts. Traditional APIs for messaging are susceptible to sniffing, and the data being transferred between hosts is often enough to compromise the entire computation. Due to the large computational requirements of underlying applications, it is highly desirable that secure protocols add minimal overhead to the original algorithm. Finally, by its very nature, dot-products leak limited amounts of information - one of the parties can detect an entry of the other party's vector by simply probing it with a vector with a I in a particular location and zeros elsewhere. We present an extremely efficient and sufficiently secure protocol for computing the dot-product of two vectors using linear algebraic techniques. Using analytical as well as experimental results, we demonstrate superior performance in terms of computational overhead, numerical stability, and security. We show that the overhead of a two-party dot-product computation using MPI as the messaging API across two high-end workstations connected via a Gigabit ethernet approaches multiple 4.69 over an unsecured dot-product. We also show that the average relative error in dot-products across a large number of random (normalized) vectors was roughly 4.5 × 10-9. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Randomized broadcast channel access algorithms for ad hoc networks

    Page(s): 151 - 158
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (459 KB) |  | HTML iconHTML  

    The problem of broadcast channel access in single-hop and multihop ad hoc networks is considered. Two novel randomized and distributed channel access algorithms are developed and analyzed for single-hop and multihop networks, respectively. These algorithms are designed based on maximizing the worst-case channel efficiency, by optimizing some key parameters, including the backoff probability distribution and slot length. The proposed algorithm for single-hop networks offers significantly higher throughput than the CSMA methods when the traffic load is heavy, while still achieving good performance when the load is light or medium. The proposed algorithm for multihop networks can flexibly adapt to the traffic load, and offers much better throughput performance than the existing broadcast scheduling algorithms in light or medium load. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient global object space support for distributed JVM on cluster

    Page(s): 371 - 378
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (297 KB) |  | HTML iconHTML  

    We present the design of a global object space in a distributed Java Virtual Machine that supports parallel execution of a multi-threaded Java program on a cluster of computers. The global object space virtualizes a single Java object heap across machine boundaries to facilitate transparent object accesses. Based on the object connectivity information that is available at runtime, the object reachable from threads at different nodes, called a distributed-shared object, are detected With the detection of distributed-shared objects, we can alleviate overheads in maintaining the memory consistency within the global object space. Several runtime optimization methods have been incorporated in the global object space design, including an object home migration method that reallocates the home of a distributed-shared object, synchronized method migration that allows the remote execution of a synchronized method at the home node of its synchronized object, and object pushing that uses the object connectivity information to improve access locality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EMPOWER: a scalable framework for network emulation

    Page(s): 185 - 192
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB) |  | HTML iconHTML  

    The development and implementation of new network protocols and applications need accurate, scalable, reconfigurable, and inexpensive tools for debugging, testing, performance tuning and evaluation purposes. Network emulation provides a fully controllable laboratory network environment in which protocols and applications can be evaluated against predefined network conditions and traffic dynamics. In this paper, we present a new framework of network emulation EMPOWER. EMPOWER is capable of generating a decent network model based on the information of an emulated network, and then mapping the model to an emulation configuration in the EMPOWER laboratory network environment. It is highly scalable not only because the number of emulator nodes may be increased without significantly increasing the emulation time or worrying about parallel simulation, but also because the network mapping scheme allows flexible ports aggregation and derivation. By dynamically configuring a virtual device, effects such as link bandwidth, packet delay, packet loss rate, and out-of-order delivery, can be emulated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Introducing SCSI-to-IP cache for storage area networks

    Page(s): 203 - 210
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (435 KB) |  | HTML iconHTML  

    Data storage plays an essential role in today's fast-growing data-intensive network services. iSCSI is one of the most recent standards that allow SCSI protocols to be carried out over IP networks. However, the disparities between SCSI and IP prevent fast and efficient deployment of SAN (storage area network) over IP. This paper introduces STICS (SCSI-To-IP cache storage), a novel storage architecture that couples reliable and high-speed data caching with low-overhead conversion between SCSI and IP protocols. Through the efficient caching algorithm and localization of certain unnecessary protocol overheads, STICS significantly improves performance over current iSCSI system. Furthermore, STICS can be used as a basic plug-and-play building block for data storage over IP. We have implemented software STICS prototype on Linux operating system. Numerical results using popular PostMark benchmark program and EMC's trace have shown dramatic performance gain over the current iSCSI implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A best-effort communication protocol for real-time broadcast networks

    Page(s): 519 - 526
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (935 KB) |  | HTML iconHTML  

    In this paper, we present a best-effort communication protocol, called ABA, that seeks to maximize aggregate application benefit and deadline-satisfied ratio of asynchronous real-time distributed systems that use CSMA/DDCR broadcast networks. ABA considers an application model where end-to-end timeliness requirements of trans-node application tasks are expressed using Jensen's benefit functions. Furthermore, the protocol assumes that the application is designed using CSMA/DDCR feasibility conditions that is driven by a "best" possible estimate of upper bounds on message arrival densities that is possible at design-time. When such design-time postulations get violated at run-time, ABA directs message traffic so that messages that will increase applications' aggregate benefit are only transmitted, buffering others, until such time when the workloads respect their design-time postulated values. To study the performance of ABA, we consider a previously studied algorithm called RBA* as a baseline algorithm. Our experimental results indicate that ABA yields higher aggregate benefit and higher deadline-satisfied ratio than RBA* when message arrival densities increase at faster rates or at the same rates as that of process execution latencies due to the dynamics of the workload. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of memory hierarchy performance of block data layout

    Page(s): 35 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (410 KB) |  | HTML iconHTML  

    Recently, several experimental studies have been conducted on block data layout as a data transformation technique used in conjunction with tiling to improve cache performance. We provide a theoretical analysis for the TLB and cache performance of block data layout. For standard matrix access patterns, we derive an asymptotic lower bound on the number of TLB misses for any data layout and show that block data layout achieves this bound. We show that block data layout improves TLB misses by a factor of O(B) compared with conventional data layouts, where B is the block size of block data layout. This reduction contributes to the improvement in memory hierarchy performance. Using our TLB and cache analysis, we also discuss the impact of block size on the overall memory hierarchy performance. These results are validated through simulations and experiments on state-of-the-art platforms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.