By Topic

Parallel and Distributed Processing Symposium, 2003. Proceedings. International

Date 22-26 April 2003

Filter Results

Displaying Results 1 - 25 of 447
  • New dynamic heuristics in the client-agent-server model

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (359 KB) |  | HTML iconHTML  

    MCT is a widely used heuristic for scheduling tasks onto Grid platforms. However, when dealing with many tasks, MCT tends to dramatically delay already mapped task completion time, while scheduling a new task. In this paper we propose heuristics based on two features: the historical trace manager that simulates the environment and the perturbation that defines the impact a new allocated task has on already mapped tasks. Our simulations and experiments on a real environment show that the proposed heuristics outperform MCT. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel vision processing and dedicated parallel architectures

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (193 KB) |  | HTML iconHTML  

    This paper proposes an overview of different processing paradigms inherent to image and vision processing, and of hardware support for them. It identifies some still open questions in parallel image/vision processing. The paper is organized into four parts. Part one discusses three main problems inherent to image and vision processing: processing flux parallelism, data parallelism and application constraints. Part two makes an overview of the flux (or control) parallelism hardware support for vision processing. Part three discusses the data parallelism concept and the architectural support for it existing in vision and image architectures. The presentation concludes with a discussion of future requirements for processors. Our conviction is that the future microprocessors, because of technology advances, will progressively integrate the structures of today's parallel dedicated computers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved methods for divisible load distribution on k-dimensional meshes using pipelined communications

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (447 KB) |  | HTML iconHTML  

    We give the closed form solutions to the parallel time and speedup of the classic method for processing divisible loads on linear arrays as functions of N, the network size. We propose two methods which employ pipelined communications to distribute divisible loads on linear arrays. We derive the closed form solutions to the parallel time and speedup for both methods and show that the asymptotic speedup of both methods is β+1, where β is the ratio of the time for computing a unit load to the time for communicating a unit load Such performance is even better than that of the known methods on k-dimensional meshes with k>1. The two new algorithms which use pipelined communications are generalized to distribute divisible loads on k-dimensional meshes, and we show that the asymptotic speedup of both algorithms is kβ+1, where k≥1. We also prove that on k-dimensional meshes where k≥1, as the network size becomes large, the asymptotic speedup of 2kβ+1 can be achieved for processing divisible loads by using interior initial processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cache pollution in Web proxy servers

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (344 KB) |  | HTML iconHTML  

    Caching has been used for decades as an effective performance enhancing technique in computer systems. The Least Recently Used (LRU) cache replacement algorithm is a simple and widely used scheme. Proxy caching is a common approach to reduce network traffic and delay in many World Wide Web (WWW) applications. However, some characteristics of WWW workloads make LRU less attractive in proxy caching. In the recent years, several more efficient replacement algorithms have been suggested. But, these advanced algorithms require a lot of knowledge about the workloads and are generally difficult to implement. The main attraction of LRU is its simplicity. In this paper we present two modified LRU algorithms and compare their performance with the LRU. Our results indicate that the performance of the LRU algorithm can be improved substantially with very simple modifications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic mapping in a heterogeneous environment with tasks having priorities and multiple deadlines

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (385 KB) |  | HTML iconHTML  

    In a distributed heterogeneous computing system, the resources have different capabilities and tasks have different requirements. To maximize the performance of the system, it is essential to assign resources to tasks (match) and order the execution of tasks on each resource (schedule in a manner that exploits the heterogeneity of the resources and tasks. The mapping (defined as matching and scheduling) of tasks onto machines with varied computational capabilities has been shown, in general, to be an NP-complete problem. Therefore, heuristic techniques to find a near-optimal solution to this mapping problem are required. Dynamic mapping is performed when the arrival of tasks is not known a priori. In the heterogeneous environment considered in this study, tasks arrive randomly, tasks are independent (i.e., no communication among tasks), and tasks have priorities and multiple deadlines. This research proposes, evaluates, and compares eight dynamic heuristics. The performance of the best heuristics is 83% of an upper bound. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel and distributed computing for an adaptive visual object retrieval system

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (468 KB) |  | HTML iconHTML  

    Computer vision and image processing has always been an active research domain that requires an enormous computational effort. In this paper we present the architecture of a modular object retrieval system. It is based on a dataflow concept which allows flexible adaptations to different tasks. This concept facilitates parallel processing as well as distributed computing. We also present a dynamic load balancing service for heterogeneous environments that has been integrated to improve system performance. First experiments show that the developed balancer performs better than standard balancing techniques in this environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On optimal hierarchical configuration of distributed systems on mesh and hypercube

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (443 KB) |  | HTML iconHTML  

    This paper studies hierarchical configuration of distributed systems for achieving optimized system performance. A distributed system consists of a collection of local processes which are distributed over the network of processors and cooperate to perform some functions. An hierarchical approach is to group and organize the distributed processes into a logical hierarchy of multiple levels, so as to coordinate the local computation/control activities to improve the overall system performance. It has been proposed as an effective way to solve various problems in distributed computing, such as distributed monitoring, resource scheduling, and network routing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (341 KB) |  | HTML iconHTML  

    Clustering, or unsupervised classification, has many uses in fields that depend on grouping results from large amount of data, an example being the N-body cosmological simulation in astrophysics. In this paper, we study a particular clustering algorithm used in astrophysics, called HOP, and present a parallel implementation to speed up its current sequential implementation. Our approach first builds in parallel the spatial domain hierarchical data structure, a three-dimensional KD tree. Using a KD tree, the core of the HOP algorithm that searches for the highest density neighbor can be performed using only subsets of the particles and hence the communication cost is reduced. We evaluate our implementation by using data sets from a production cosmological application. The experimental results demonstrate up to 24× speedup using 64 processors on three parallel processing machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulation of meshes with separable buses by meshes with multiple partitioned buses

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (269 KB) |  | HTML iconHTML  

    This paper studies the simulation problem of meshes with separable buses (MSB) by meshes with multiple partitioned buses (MMPB). The MSB and the MMPB are the mesh connected computers enhanced by the addition of broadcasting buses along every row and column. The broadcasting buses of the MSB, called separable buses, can be dynamically sectioned into smaller bus segments by program control, while those of the MMPB, called partitioned buses, are statically partitioned in advance. In the MSB model, each row/column has only one separable bus, while in the MMPB model, each row/column has L partitioned buses (L ≥ 2). We consider the simulation and the scaling-simulation of the MSB by the MMPB, and show that the MMPB of size n × n can simulate the MSB of size n × n in O(n1(2L)/) steps, and that the MMPB of size m × m can simulate the MSB of size n × n in O(n/m(n/m+m1(2L)/)) steps (m < n). The latter result implies that the MMPB of size m × m can simulate the MSB of size n × n time-optimally when m ≤ nα holds for α = 1/1+1/(2L). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing a scalable ASC processor

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    Previous papers (Walker et al. (2001); Wu et al. (2002)) have described our implementation of a small prototype processor and control unit for associative computing, called the ASC processor. That initial prototype was implemented on an Altera education board using an Altera FLEX 10K FPGA, and was limited to an unrealistic 4 processing elements (PE). This paper describes a more complete implementation - a scalable ASC processor that can scale up to 52 PE on an Altera APEX 20KE board, or further on larger FPGA. This paper also proposes extensions to support multiple control units and control parallelism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal algorithms for scheduling divisible workloads on heterogeneous systems

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (493 KB) |  | HTML iconHTML  

    In this paper, we discuss several algorithms for scheduling divisible loads on heterogeneous systems. Our main contributions are (i) new optimality results for single-round algorithms and (ii) the design of an asymptotically optimal multi-round algorithm. This multi-round algorithm automatically performs resource selection, a difficult task that was previously left to the user. Because it is periodic, it is simpler to implement, and more robust to changes in the speeds of the processors and/or communication links. On the theoretical side, to the best of our knowledge, this is the first published result assessing the absolute performance of a multi-round algorithm. On the practical side, extensive simulations reveal that our multi-round algorithm outperforms existing solution on a large variety of platforms, especially when the communication-to-computation ratio is not very high (the difficult case). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Use of the parallel port to measure MPI intertask communication costs in COTS PC clusters

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (308 KB) |  | HTML iconHTML  

    Performance analysis of system time parameters is important for the development of parallel and distributed programs because it provides a means of estimating program execution times and it is important for scheduling tasks on processors. Measuring time intervals between events occurring in different nodes of COTS clusters of workstations is not a trivial task due to the absence of a unified clock view. We propose a different approach to measure system time parameters and program performance in clusters with the aid of the parallel port present in every machine of a COTS cluster. Some experimental values of communication delays using the MPI library in a Linux PC cluster are presented and the efficiency and precision of the proposed mechanism are analyzed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using incorrect speculation to prefetch data in a concurrent multithreaded processor

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (276 KB) |  | HTML iconHTML  

    Concurrent multithreaded architectures exploit both instruction-level and thread-level parallelism through a combination of branch prediction and thread-level control speculation. The resulting speculative issuing of load instructions in these architectures can significantly impact the performance of the memory hierarchy as the system exploits higher degrees of parallelism. In this study, we investigate the effects of executing the mispredicted load instructions on the cache performance of a scalable multithreaded architecture. We show that the execution of loads from the wrongly-predicted branch path within a thread, or from a wrongly forked thread, can result in an indirect prefetching effect for later correctly-executed paths. By continuing to execute the mispredicted load instructions even after the instruction- or thread-level control speculation is known to be incorrect, the cache misses for the correctly predicted paths and threads can be reduced, typically by 42-73%. We introduce the small, fully-associative Wrong Execution Cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions. Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5% on the benchmark programs tested, with an average improvement of 9.7%, due to the reductions in the number of cache misses. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The feelfem system: a repository system for the finite element method

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (420 KB)  

    We have developed a finite element method (FEM) software repository tool named feelfem that serves as a code generator. One important feature of feelfem is that it is designed to generate various program models of FEM analysis, including users' own newly developed numerical schemes. Another feature is that interfaces to newly developed parallel programming paradigms and parallel solvers can easily be added to it. Software reuse is an important target of the feelfem system. To achieve flexibility and expandability for the system, we adopt an object-oriented technique and implementation-oriented pseudo-code representation of numerical algorithms. In its latest released version, feelfem has strong interaction with the personal pre/post processor GiD. By using a combination of feelfem and GiD, users can generate prototype parallel FEM applications with newly developed solvers very easily and quickly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Are we really ready for the breakthrough? [morphware]

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (295 KB) |  | HTML iconHTML  

    The paper outlines the key issues and fundamentals of morphware as a discipline of its own, and, as part of modern computing sciences. It discusses what is needed for the breakthrough. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulation of dynamic data replication strategies in Data Grids

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB) |  | HTML iconHTML  

    Data Grids provide geographically distributed resources for large-scale data-intensive applications that generate large data sets. However, ensuring efficient access to such huge and widely distributed data is hindered by the high latencies of the Internet. We address these challenges by employing intelligent replication and caching of objects at strategic locations. In our approach, replication decisions are based on a cost-estimation model and driven by the estimation of the data access gains and the replica's creation and maintenance costs. These costs are in turn based on factors such as runtime accumulated read/write statistics, network latency, bandwidth, and replica size. To support large numbers of users who continuously change their data and processing needs, we introduce scalable replica distribution topologies that adapt replica placement to meet these needs. In this paper we present the design of our dynamic memory middleware and replication algorithm. To evaluate the performance of our approach, we developed a Data Grid simulator, called the GridNet. Simulation results demonstrate that replication improves the data access time in Data Grids, and that the gain increases with the size of the datasets involved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Novel algorithms for open-loop and closed-loop scheduling of real-time tasks in multiprocessor systems based on execution time estimation

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (467 KB)  

    Most dynamic real-time scheduling algorithms are open-loop in nature meaning that they do not dynamically adjust their behavior using the performance at run-time. When accurate workload models are not available, such a scheduling can result in a highly underutilized system based on an extremely pessimistic estimation of workload. In recent years, "closed-loop" scheduling is gaining importance due to its applicability to many real-world problems wherein the feedback information can be exploited efficiently to adjust system parameters, thereby improving the performance. In this paper, we first propose an open-loop dynamic scheduling algorithm that employs overlap in order to provide flexibility in task execution times. Secondly, we propose a novel closed-loop approach for dynamically estimating the execution time of tasks based on both deadline miss ratio and task rejection ratio. This approach is highly preferable for firm real-time systems since it provides a firm performance guarantee. We evaluate the performance of the open-loop and the closed-loop approaches by simulation and modeling. Our studies show that the closed-loop scheduling offers a significantly better performance (20% gain) over the open-loop scheduling under all the relevant conditions we simulated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • CREC: a novel reconfigurable computing design methodology

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (505 KB) |  | HTML iconHTML  

    The main research done in the field of reconfigurable computing was oriented towards applications involving low granularity operations and high intrinsic parallelism. CREC is an original, low-cost general-purpose reconfigurable computer whose architecture is generated through a hardware/software codesign process. The main idea of the CREC system is to generate the best-suited hardware architecture for the execution of each software application. The CREC parallel compiler parses the source code and generates the hardware architecture, based on multiple execution units. The hardware architecture is described in VHDL code, generated by a program. Finally, CREC is implemented in an FPGA device. The great flexibility offered by the general-purpose CREC system makes it interesting for a wide class of applications that mainly involve high intrinsic parallelism, but also any other kinds of computations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using Java for plasma PIC simulations

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (287 KB) |  | HTML iconHTML  

    Plasma particle-in-cell (PIC) simulations model the interactions of charged particles with the surrounding fields. This application has been recognized as one of the grand challenge problems facing the high-performance computing community due to its huge computational requirements. Recently, with the explosive development of the Internet, Java is receiving increasing attention and is thought as a potential candidate for high-performance computing. In this paper, we present our approach to developing 2- and 3-dimensional parallel PIC simulations in Java. We also report the execution times for both versions from performance experiments on a symmetric multi-processor (Sun E6500) and a Linux cluster of Pentium III machines. Those results are also compared with benchmark measurements of the corresponding Fortran version of the same algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mesh partitioning: a multilevel ant-colony-optimization algorithm

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (503 KB) |  | HTML iconHTML  

    Mesh partitioning is an important problem that has extensive applications in many areas. Multilevel algorithms are a successful class of optimization techniques which address the mesh partitioning problem. In this paper we present an enhancement of the technique that uses a nature inspired metaheuristic to achieve higher quality partitions. We apply and study a multilevel ant-colony (MACO) optimization, which is a relatively new metaheuristic search technique for solving optimization problems. The MACO algorithm performed very well and is superior to the classical k-METIS and Chaco algorithms. Furthermore, it is even comparable to the combined evolutionary/multilevel scheme used in the JOSTLE evolutionary algorithm. Our MACO algorithm returned also some solutions that are better than currently available solutions in the graph partitioning archive. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust scheduling in team-robotics

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (291 KB) |  | HTML iconHTML  

    In most cooperating teams of robots each robot has about the same set of sensors. Distributed sensor fusion is a technique that enables a team to take advantage of this redundancy to get a more complete view of the world with a better quality of the provided information. This paper sketches a fusion algorithm for laser-scanner data and derives the requirements that the execution of this algorithm has on the underlying system infrastructure, especially CPU-scheduling. It shows that a scheduling algorithm is needed that fulfills timing guarantees without using worst case execution times (WCET). The time-aware fault-tolerant (TAFT) scheduler provides this feature: each execution entity is divided into a MainPart, with possibly unknown timing behavior, and in an ExceptionPart, with known execution time. The integrated scheduling of both parts is done by a combination of two earliest deadline scheduling strategies. One focuses on enhancing the CPU utilization and the other on guaranteeing the timely execution. The paper discusses the proposed scheduling strategy, briefly describes its implementation in a real-time OS and presents results that show the achieved real-time behavior with an increased acceptance rate, a higher throughput, and a graceful degradation in transient overload situations compared to standard schedulers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A computational strategy for the solution of large linear inverse problems in geophysics

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (567 KB) |  | HTML iconHTML  

    This paper discusses the use of singular values and singular vectors in the solution of large inverse problems that arise in the study of physical models for the internal structure of the Earth. In this study, the Earth is discretized into layers and the layers into cells, and travel times of sound waves generated by earthquakes are used to construct the corresponding physical models. The underlying numerical models lead to sparse matrices with dimensions up to 1.3×106-by-3×105. Singular values and singular vectors of these matrices are then computed and used in the solution of the associated inverse problems and also to estimate uncertainties. The paper outlines the formulation adopted to model the Earth and the strategy employed to compute singular values and singular vectors, shows results for two models that have been studied, comments on the main computation issues related to the solution of these problems on high performance parallel computers, and discusses future improvements of the adopted computational strategy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Algorithmic concept recognition support for skeleton based parallel programming

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    Parallel skeletons have been proposed as a possible programming model for parallel architectures. One of the problems with this approach is the choice of the skeleton which is best suited to the characteristics of the algorithm/program to be developed/parallelized, and of the target architecture, in terms of performance of the parallel implementation. Another problem arising with parallelization of legacy codes is the attempt to minimize the effort needed for program comprehension, and thus to achieve the minimum restructuring of the sequential code when producing the parallel version. In this paper we propose automated program comprehension at the algorithmic level as a driving feature in the task of selection of the proper parallel skeleton, best suited to the characteristics of the algorithm/program and of the target architecture. Algorithmic concept recognition can automate or support the generation of parallel code through instantiation of the selected parallel skeleton(s) with template based transformations of recognized code segments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hard real-time programming is different

    Page(s): 117 - 118
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (195 KB) |  | HTML iconHTML  

    First Page of the Article
    View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.