By Topic

Algorithms and Architectures for Parallel Processing, 1995. ICAPP 95. IEEE First ICA/sup 3/PP., IEEE First International Conference on

Date 19-21 April 1995

Go

Filter Results

Displaying Results 1 - 25 of 72
  • Benchmarking parallel simulation algorithms

    Publication Year: 1995 , Page(s): 611 - 620 vol.2
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (654 KB)  

    Parallel simulation has been an active research area for more than a decade. The parallel simulation community needs a common benchmark suite for performance evaluation of parallel simulation environments. Performance evaluation of a parallel simulation environment is harder than evaluating a parallel processing system, since the underlying system is nor only composed of architecture and operating system, but also of simulation kernel. Thus, simulation kernel designers often confront a twofold task: (i) to evaluate how efficiently their simulation kernel runs on certain architectures; and (ii) to evaluate how simulation problems scale using this kernel In this paper we advocate an incremental benchmarking methodology that focuses on the evaluation of a parallel simulation system which is based on Time Warp. We start from a reduced set of ping models that can effectively estimate the various overheads, contention and latencies of Time Warp running on a multiprocessor. The benchmark suite has been used to locate several sources of overhead in an existing Time Warp implementation. Using this benchmark suite we also compare the performance of the improved version of the Time Warp implementation with the original one. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synchronisation in safety-critical distributed control systems

    Publication Year: 1995 , Page(s): 891 - 899 vol.2
    Cited by:  Papers (4)  |  Patents (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB)  

    Distributed computer systems for real-time control require a global timebase with high precision. A small time skew between local clocks in the system is required to obtain good control performance through well synchronised task execution, but also provides a base for efficient communication. In distributed safety critical applications, clocks have traditionally been synchronised with fault tolerant clock synchronisation algorithms. With these methods, a limited number of erroneous clock readings are allowed in each adjustment. On the other hand, readings from all clocks in the system are required before an adjustment can be made. In this paper an alternative approach, the Daisy Chain method, is proposed and compared with present solutions. Daisy Chain synchronisation does not allow erroneous clock readings, but methods of avoiding them are described. Due to its simplicity, the method can be implemented with little hardware. Low precision frequency sources are sufficient and recovery after arbitrary failures is fast because no special start up phase is required. The paper also discusses effects of quantisation uncertainty and transmission delay, and outline the implementation of a global time base in an embedded distributed real-time architecture View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of real-time parallel processing in a motion control system

    Publication Year: 1995 , Page(s): 900 - 904 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB)  

    The paper proposes an architecture and implementation of a motion control system, with a master-slave multiprocessor mode. Some major problems which must be considered in multiprocessor systems design, including multiprocessor system architecture, interconnection network, hardware circuit design and software design are studied View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PVM implementation of quadtree building algorithms on SIMD hypercube system

    Publication Year: 1995 , Page(s): 855 - 858 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB)  

    Representation of data using hierarchical data structures is commonly used in applications such as computer graphics, digital image processing, computer vision and techniques are being evolved for efficient representation of these data. Transforming bilevel images to linear quadtrees is a way of representing the high-volume data. In this paper, the preliminary investigation and results thus obtained for transforming binary images to linear quadtrees using Parallel Virtual Machine System Software are presented. Single Instruction Multiple Data hypercube algorithms implemented using PVM software was tested under DOS operating system on IBM compatible PCs. The quadtree algorithm generates locational codes in pre-order and generally runs in O(log n) time and this paper tested the feasibility of achieving this time for an SIMD machine View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory design for row/column/diagonal access

    Publication Year: 1995
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (24 KB)  

    Summary form only given. Vectorizing involves parallel access to data elements from a random access memory (RAM). However, a single memory module of conventional design can access no more than one word during each cycle of the memory clock. One common solution is to partition the memory into multiple modules or memory banks with address interleaving, leading to a number of disadvantages and restrictions over vectorizing. A different approach is to design memory modules with build-in access ability to commonly used array partitions. In this paper, a new memory organization is proposed, in which words can be formed row-wise, column-wise or diagonally at the control of an external input. The behavioral and structural representation of this design have been defined View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An optimal lower bound on the maximum speedup in multiprocessors with clusters

    Publication Year: 1995 , Page(s): 640 - 649 vol.2
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (548 KB)  

    We consider an ideal multiprocessor system with q processors and a centralized scheduler without overhead that selects processes from one common pool, permitting dynamic relocation of processes. A parallel program P consisting of n processes is executed on this system and terminates when all processes are completed. Due to synchronizations, processes may be blocked while waiting for events in other processes. The parallel program is executed using some schedule of processes to processors, resulting in a speedup σ. We then consider an ideal multiprocessor with k clusters containing u processors each. In this system processes may not be relocated between clusters. Finding a schedule which results in maximum speedup is NP-hard. Here, we present a formula for the optimal lower bound on the maximum speedup for program P, as a function of q, n, σ, k and u. We also present a formula for the optimal lower bound when the number of processes (n) is unknown. Using these results we are able to decide if a certain schedule is close to optimal or if it is worth-while to look for other schedules. This is demonstrated by evaluating the speedup of a specific schedule of a particular program View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An optimal parallel algorithm for the Euclidean distance maps of binary images

    Publication Year: 1995
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (64 KB)  

    The Euclidean distance map (EDM) of a black and white n×n binary image is the n×n map where each element has the Euclidean distance between the corresponding pixel and the nearest black pixel. The EDM plays an important role in machine vision, pattern recognition and robotics. Many algorithms have been proposed for computing the EDM. In recent years, O(n2) time sequential algorithms were presented for computing the EDM. Hirata and Kato (1994) showed that their algorithm can be parallelized to run in O(n2/p) time using p processors (1⩽p⩽n) on the EREW PRAM. We present a parallel algorithm for computing the EDM. The algorithm runs in O(log n) time using n2/log n processors on the EREW PRAM and in O(log n/log log n) time using n2 log log n/log n processors on the common CRCW PRAM, respectively. The algorithm is optimal in the sense that the product of the time and the number of processors is equal to the lower bound of the sequential time for computing the EDM View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A performance comparison of buffering schemes for multistage switches

    Publication Year: 1995 , Page(s): 809 - 818 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    Multistage Interconnection Networks (MIN) are used to connect processors and memories in large scale scalable multiprocessor systems. MINs have also been proposed as switching fabrics in ATM networks in the future Broadband ISDN networks. A MIN consists of several stages of small crossbar switching elements (SE). Buffers are used in the SEs to increase the throughput of the MIN and prevent internal loss of packets. Different buffering schemes for the SEs are discussed in this paper. The objective of this paper is to study the performance of MINs with different buffering schemes, in the presence of uniform and hot spot traffic patterns. The results obtained from the study will help the network designers in choosing appropriate buffering strategies for MINs. For comparing different buffering strategies, the throughput and packet delay have been used as the performance measures View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of shared buffer multistage networks with hot spot

    Publication Year: 1995 , Page(s): 799 - 808 vol.2
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (612 KB)  

    Multistage interconnection networks based on shared buffering are known to have better performance and buffer utilization than input or output buffered switches. Shared buffer switches do not suffer from head of line blocking which is a common problem in simple input buffering. Shared buffer switches have previously been studied under uniform and unbalanced traffic patterns. However, due to the complexity of the model, the performance of such a network, in the presence of a single hot spot, has not been fully explored. A hot spot arises when one of the outputs of the network becomes very popular. We develop a model for a multistage interconnection network constructed from shared buffer switching elements and operating under a hot spot traffic pattern. The model is validated by comparison with simulation results. The model is used to study the network performance in terms of the throughput, packet delay, packet loss probability and the optimal buffer utilization, Numerical results show that, in the presence of hot spot traffic, shared buffer switches degrade more significantly than switches with dedicated input and/or output buffers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Handling data skew in parallel hash join computation using two-phase scheduling

    Publication Year: 1995 , Page(s): 527 - 536 vol.2
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    A large number of parallel join algorithms has been proposed to maintain load-balancing in the presence of data skew. However, one important type of data skew-join product skew (JPS)-has been little studied. In this paper, a dynamic parallel join algorithm, which employs a two-phase scheduling procedure, is designed to handle the JPS problem. Two sets of scheduling heuristics are studied against various parameters. It is shown that many of the existing algorithms can be regarded as a special case of our algorithm, whose cost is based on the nature of data skew. While it can cope with JPS which other algorithms cannot approach, it can be as efficient as most existing algorithms when JPS does not exist View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a choice of region algorithms for the evaluation of parallel vision architectures

    Publication Year: 1995
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (52 KB)  

    The research presented here focuses on the general problem of finding tools and methods for parallel machine evaluation. “Benchmarks” are the most popular tools for machines speed comparison, unfortunately, they do not give any information on the most convenient hardware structures for implementation of a given vision problem. This paper tries to overcome this problem, and proposes a characterization of a tool for the evaluation of parallel architecture (which is a generalization of the benchmark concept); however, here we focus on the area of computer vision View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and performance measurements of an execution model for the parallel processing of Prolog programs

    Publication Year: 1995 , Page(s): 650 - 658 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (376 KB)  

    This paper presents a hierarchical parallel execution model for Prolog programs, the execution model is based on Or-parallelism/And-parallelism as coarse-grain parallelism, and parallel unification as fine-grain parallelism. At the coarse-grain parallelism level we propose an extended And-Or tree. Consequently, the tree can exploit high degree of parallelism from Prolog programs. Exploiting parallelism of Prolog programs is based an the binding-arrays method for Or-parallelism and the restricted And-parallelism (RAP) method for And-parallelism. At the fine-grain parallelism level, parallel unification is performed. In general, the parallel unification consists of parallel argument matching and consistency checking. However, since the RAP method does not need consistency checking, consistency checking at the fine-grain parallelism level is also removed. The measurements of the parallelism degree of this model are also to be presented in this paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel program debugging using scalable visualization

    Publication Year: 1995 , Page(s): 699 - 708 vol.2
    Cited by:  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (644 KB)  

    The paper describes methods and tools for debugging parallel programs by visualization and animation of the execution behavior of the programs. Based on an evaluation and classification of existing visualization environments, the visualization and animation tool VISTOP (VISualization TOol for Parallel Systems) was developed as part of the integrated tool environment TOPSYS (TOols for Parallel SYStems) for programming distributed memory multiprocessors. VISTOP supports the interactive online visualization of message passing programs based on various views, in particular, a process graph based concurrency view for detecting synchronization and communication bugs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing photon event recognition algorithms on a 3D-flow system

    Publication Year: 1995 , Page(s): 761 - 769 vol.2
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB)  

    This report describes an implementation on the 3D-flow system developed at the Superconducting Super Collider Lab. of the algorithms and equipment to recognize valid photon events using a morphological analysis of the signals of an intensified CCD in the photon counting mode. The analysis consists of calculating the coordinates of a matrix corresponding to the exact position of each incident photon on the channel plate. Several off-line calculations with efficiency studies aiming at finding the best algorithm for event reconstruction have been performed. This off-line algorithm can be accomplished in real time at the CCD input rate (up to 2000 frames/sec). The communication-intensive nature of the algorithm and of the topology of this application and the particular architecture of the 3D-flow system lead to a very efficient implementation. The existing hardware simulator allows studies of the entire system before actual construction View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An improvement to dynamic stream handling in dataflow computers

    Publication Year: 1995
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (64 KB)  

    This paper presents a new method of implementing dynamic streams of streams using token relabelling which reduces the complexity and drawbacks of the previously proposed method due to Gaudiot. Consider a sequence of tokens, Vi[ui], which will appear in sequence on the stream-carrying arc. Two tokens Va[ux] and Vb[uy] , will be considered belonging to the same stream if they have the same context: [ux]=[uy]. Elements within a stream are ordered according to the sequence in time that they appear on the arc. Let the highest level of streams has the context [uO], that of the surrounding block. Thus the highest level stream is the sequence of values Vi[uO]. Each element of this stream has as its value a unique context, namely, that of the stream that it represents. So the token Vi [uO] identifies as a stream the sequence of tokens whose context is [Vi] View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Object-oriented expert systems parallel knowledge processing in a CORBA-based architecture

    Publication Year: 1995
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (76 KB)  

    Knowledge based systems (KBS) of first generation are characterized by the separation of domain specific knowledge and general problem solving strategies. Such systems lack of the following important abilities: knowledge acquisition according different paradigms of representation and processing, definition of deep knowledge caused by physiological processes of reasoning, and management of different abstraction levels. Next generation KBS provide a bases from managing these problems. Essential characteristics of this new systems are modularization of knowledge, distribution of knowledge across different hardware and software resources, and use of object-oriented technology for integrating symbolic and subsymbolic knowledge on different levels of abstraction View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A software instrumentation technique for performance tuning of message-passing programs

    Publication Year: 1995 , Page(s): 595 - 598 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (268 KB)  

    A major problem with collecting trace data for performance monitoring is its intrusiveness to the program being monitored. It sometimes distorts the run-time behaviour of the program so that the collected data become irrelevant to its original program. We proposed a new technique, called the postponing technique, to maintain the original program behaviour in order to collect accurate performance data. It preserves event orders by equalling the instrumentation delay for each pair of communication events. This technique does not extend the execution time taken by the conventional approach and is able to estimate the original event ordering. Our technique was implemented on a Connection Machine, CM-5. We find that the technique estimates more accurate event ordering information than the conventional technique View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance modelling for a distributed ISDN protocol test system

    Publication Year: 1995 , Page(s): 819 - 828 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB)  

    Conformance testing of communication protocols has recently become a major issue in the context of OSI-based standardization of protocols. The aim of conformance testing is to assure that a protocol fulfils an OSI specification. A performance study is presented for a distributed protocol test system that has been installed for conformance testing of the ISDN D-channel signalling protocol. Using a general approach for performance measurements and evaluation in distributed systems, a queueing model is developed and evaluated, based on runtimes as obtained from measurements of the test system. It is demonstrated that significant performance improvements can be achieved once the process scheduling strategy at the ISDN protocol testers is properly adjusted View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel algorithms and architectures for CPUs and dedicated processors: development and trends

    Publication Year: 1995 , Page(s): 939 - 948 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB)  

    Parallel algorithms are usually intended as those related to problems to be run on supercomputers characterized by a large number of processors interacting via a communication network. Parallel algorithms and architectures are relevant also at a lower levels. Of particular interest is the CPU level where elementary arithmetic (and higher order) operations are executed. It might be surprising to notice that even today those algorithms and architectures are still under active investigation due to their importance in all computing mechanisms. A brief historical survey will be given also in order to highlight trends of development. At a higher level, the implementation of specific parallel algorithms (in particular for signal and image processing) are becoming necessary. In an increasing number of applications. Some cases will be considered (various transforms, a calorimeter for high-energy particle physics in future supercollider at CERN, a frequency band compressor for TV signals (MPEG)), pointing out that often their efficient design is still a bottleneck requiring a considerable research effort. Issues such as CAD systems and related languages, defect and fault tolerance, testability and hardware/software co-design are finally discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SIMD hypercube algorithm for complete Euclidean distance transform

    Publication Year: 1995 , Page(s): 874 - 877 vol.2
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (252 KB)  

    The Euclidean distance transform (EDT) converts a binary image into one where each pixel has a value equal to its Euclidean distance to the nearest foreground pixel. A parallel EDT algorithm on SIMD hypercube computer is presented here. For an n×n image, the algorithm has a time complexity of O(n) on an n2 nodes machine. With modifications to minimize dependency among partitions, the algorithm can be adapted to compute large EDT problems on smaller hypercubes. On a hypercube of t2 nodes, the time complexity of the modified algorithm is O(n2/t log n/t) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multicast switching network for B-ISDN

    Publication Year: 1995 , Page(s): 916 - 919 vol.2
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (236 KB)  

    To support broadband integrated services digital networks (B-ISDN), switching networks must have the ability to provide both multipoint connections (multicasting) and point-to-point connections (unicasting). This paper proposes a multicast switching network based on a recently proposed routing network which consists of two banyan networks with links at every stage to allow cell transfer to and from each banyan plane, thereby offering multiple paths between each input-output pair. The proposed multicast network employs the copy and routing networks in a parallel configuration. This approach allows for unicast cells to proceed to the routing network without additional delay and keeps the copy network free of unicast traffic which results in a larger amount of the multicast requests to be successfully replicated. Using simulations, the proposed multicast network was shown to offer better performance than other networks in terms of cell loss rates View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Orchid: the design of a parallel and portable software platform for local area networks

    Publication Year: 1995
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (60 KB)  

    Orchid is a portable software platform aiming to decouple the parallel software development from the underlying system. Having layered structure, Orchid can be easily ported to different architectures only by reconstructing its lowest level. It also provides advanced facilities not supported by most operating systems and software platforms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison between the powers of the PARBS and the RMBM

    Publication Year: 1995 , Page(s): 506 - 510 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (308 KB)  

    The Processor Array with Reconfigurable Bus System (PARBS) and the Reconfigurable Multiple Bus Machine (RMBM) are models of parallel computation based on reconfigurable bus and processor array. The PARBS is a processor array that consists of processors arranged to a 2-dimensional grid with a reconfigurable bus system. The RMBM is also made of processors and reconfigurable bus system, but the processors are located in a row and the number of processors and the number of buses are independent of each other. In this paper, we describe that the computational power of the PARBS is equal to that of the RMBM on condition that two models are polynomially bounded. This is because that one model can be simulated in constant time by another model View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimization in a hierarchical distributed performance monitoring system

    Publication Year: 1995 , Page(s): 537 - 544 vol.2
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (380 KB)  

    Monitoring program execution in a distributed system can generate large quantities of data, and the collection and processing of the monitoring data is one of the primary factors that contribute to the complexity of distributed monitoring. In order to reduce such complexity, a hierarchical distributed performance monitoring system has been developed. In this paper we describe an optimization method to improve the efficiency of the monitoring system. By considering the topology used by the application program and the distribution of monitoring records, an optimized grouping can be determined to obtain an improved performance for the monitoring system. The experiments presented in this paper have demonstrated such an improvement in performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel image matching on a distributed system

    Publication Year: 1995 , Page(s): 870 - 873 vol.2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB)  

    Image matching based on image feature pixels involves heavily iterated computation and repeated memory access. In our previous work the detection of interesting points has been reported as an efficient pre-processing step to extract binary images for further matching in terms of certain distance measurement. This paper presents our extension to a parallel implementation of the matching scheme for object recognition on a low cost heterogeneous PVM (Parallel Virtual Machine) network. While most of the sequential execution time is spent on image feature extraction, distance transform and matching measurement, our investigation shows that a distributed memory multicomputer can best meet the high computational and memory access demands in image processing. The performance is evaluated in terms of execution time. We conclude that parallel image processing con be implemented on a general distributed system to achieve the speedup without specific hardware requirement View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.