By Topic

High Performance Computing (HiPC), 2010 International Conference on

Date 19-22 Dec. 2010

Filter Results

Displaying Results 1 - 25 of 42
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (57 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): 1 - 4
    Save to Project icon | Request Permissions | PDF file iconPDF (144 KB)  
    Freely Available from IEEE
  • Diagnosing the root-causes of failures from cluster log files

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (475 KB) |  | HTML iconHTML  

    System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of a given failure. Furthermore, the process of troubleshooting the causes of failures is largely manual and ad-hoc. In this paper, we present a systematic methodology for reconstructing event order and establishing correlations among events which indicate the root-causes of a given failure from very large syslogs. We developed a diagnostics tool, FDiag, to extract the log entries as structured message templates and uses statistical correlation analysis to establish probable cause and effect relationships for the fault being analyzed. We applied FDiag to analyze failures due to breakdowns in interactions between the Lustre file system and its clients on the Ranger supercomputer at the Texas Advanced Computing Center (TACC). The results are positive. FDiag is able to identify the dates and the time periods that contain the significant events which eventually led to the occurrence of compute node soft lockups. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Balanced stream assignment for service facility

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB) |  | HTML iconHTML  

    Shared data centers and clouds are gaining popularity because of their ability to reduce costs by increasing the utilization of server farms. In a shared server environment, a careful assignment of workload streams (all work-requests from a customer may constitute a stream) to servers is necessary to ensure good “end user” performance. In this work, we investigate the assignment of streams to servers in order to minimize an objective function, while ensuring that load is balanced across all the servers. The objective functions we optimize in this work include the overall expected waiting-time, overall probability of the wait exceeding a given value, and weighted versions of these measures. We obtain the optimal algorithm for a farm with 2 servers, if sharing of streams among servers is allowed. Based on the insights obtained, we design an efficient algorithm for the multiserver case. By rounding off this solution, we obtain a solution to the case where sharing of streams is not allowed. Our trace-driven evaluation study shows that our algorithms significantly outperform baseline methods. Our work enables high performance for web hosting services as well as emerging Application as a Service (AaaS) clouds. We also show that solutions in areas such as task-level scheduling and file assignment fall within our framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing data center power with server consolidation: Approximation and evaluation

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (289 KB) |  | HTML iconHTML  

    With the growing costs of powering data centers, power management is gaining importance. Server consolidation in data centers, enabled by virtualization technologies, is becoming a popular option for organizations to reduce costs and improve manageability. While consolidation offers these benefits, it is important to ensure proper resource provisioning so that performance is not compromised. In addition to reducing the number of servers, there are other knobs - such as frequency/voltage scaling - that are being offered by recent hardware for finer granularity of power control. In this paper, we look at exploiting server consolidation and frequency/voltage control to reduce power consumption, while meeting certain provisioning guarantees. We formulate the problem as a variant of variable-sized bin packing. We show that the problem is NP-hard, and present an approximation algorithm for the same. The algorithm takes O(n2 log n) time for n workloads, and has a provable approximation ratio. Experimental evaluation shows that in practice our algorithm obtains solutions very close (<; 6.5% difference) to optimal. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of colinearity of sensors selected for location estimation

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (835 KB) |  | HTML iconHTML  

    We consider estimating the location of a target moving in a 2D plane by combining distance measurements from multiple sensors. Given that available energy in sensors is at a premium, it is desirable that energy be conserved by selecting fewer number of sensors that measure distance and communicate to the central tracker. We propose heuristics on the basis of which a handful of sensors may be selected. In this paper, we present one of the heuristic. The particular heuristic discussed at length is based on colinearity of sensors. In this paper, we provide a formal and theoretical basis for arriving at the heuristic and evaluate its utility. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-overhead diskless checkpoint for hybrid computing systems

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (665 KB) |  | HTML iconHTML  

    As the size of new supercomputers scales to tens of thousands of sockets, the mean time between failures (MTBF) is decreasing to just several hours and long executions need some kind of fault tolerance method to survive failures. CheckpointRestart is a popular technique used for this purpose; but writing the state of a big scientific application to remote storage will become prohibitively expensive in the near future. Diskless checkpoint was proposed as a solution to avoid the I/O bottleneck of disk-based checkpoint. However, the complex time-consuming encoding techniques hinder its scalability. At the same time, heterogeneous computing is becoming more and more popular in high performance computing (HPC), with new clusters combining CPUs and graphic processing units (GPUs). However, hybrid applications cannot always use all the resources available on the nodes, leaving some idle resources suc h us GPUs or CPU cores. In this work, we propose a hybrid diskless checkpoint (HDC) technique for GPU-accelerated clusters, that can checkpoint CPU/GPU applications, does not require spare nodes and can tolerate up to 50% of process failures with a low, sometimes negligible, checkpoint overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GRS — GPU radix sort for multifield records

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (779 KB) |  | HTML iconHTML  

    We develop a radix sort algorithm, GRS, suitable to sort multifield records on a graphics processing unit (GPU). We assume the ByField layout for records to be sorted. GRS is benchmarked against the radix sort algorithm, SDK, in NVIDIA's CUDA SDK 3.0 as well as the radix sort algorithm, SRTS, of Merrill and Grimshaw. Although SRTS is faster than both GRS and SDK when sorting numbers as well as records that have a key and an additional 32-bit field, both GRS and SDK outperform SRTS on records with 2 or more fields (in addition to the key). GRS is consistently faster than SDK on numbers as well as records with 1 or more fields. When sorting records with 9 32-bit fields, GRS is up to 74% faster than SRTS and up to 55% faster than SDK. Thus, GRS is the fastest way to radix sort records with more than 1 32-bit field on a GPU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic social grouping based routing in a Mobile Ad-Hoc network

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (821 KB) |  | HTML iconHTML  

    The patterns of movement used by Mobile Ad-Hoc networks are application specific, in the sense that networks use nodes which travel in different paths. When these nodes are used in experiments involving social patterns, such as wildlife tracking, algorithms which detect and use these patterns can be used to improve routing efficiency. The intent of this paper is to introduce a routing algorithm which forms a series of social groups which accurately indicate a node's regular contact patterns while dynamically shifting to represent changes to the social environment. With the social groups formed, a probabilistic routing schema is used to effectively identify which social groups have consistent contact with the base station, and route accordingly. The algorithm can be implemented dynamically, in the sense that the nodes initially have no awareness of their environment, and works to reduce overhead and message traffic while maintaining high delivery ratio. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fair bandwidth allocation in wireless mobile environment using max-flow

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (559 KB) |  | HTML iconHTML  

    Wireless clients must associate to a specific Access Point (AP) to communicate over the Internet. Current association methods are based on maximum Received Signal Strength Index (RSSI) implying that a client associates to the strongest AP around it. This is a simple scheme that has performed well in purely distributed settings. Modern wireless networks, however, are increasingly being connected by a wired backbone. The backbone allows for out-of-band communication among APs, opening up opportunities for improved protocol design. This paper takes advantage of this opportunity through a coordinated client association scheme where APs consider a global view of the network, and decides on the optimal client-AP association. We show that such an association outperforms RSSI based schemes in several scenarios, while remaining practical and scalable for wide-scale deployment. Although an early work in this direction, our basic analytical framework (based on a max-flow formulation) can be extended to sophisticated channel and traffic models. Our future work is focussed towards designing and evaluating these extensions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing data acquisition by sensor-channel co-allocation in wireless sensor networks

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (585 KB) |  | HTML iconHTML  

    Wireless sensor networks (WSNs) should handle multiple sensing tasks for various applications. How to improve the quality of the data acquired in such resource constrained environment is a challenging issue. In this paper, we propose a sensor-channel co-allocation model for scheduling the sensing tasks. The proposed model considers the capability, coupling and load balancing constraints for sensing data acquisition, and can guarantee transmission of sensed data in real-time while avoiding data incompleteness in an efficient way. A spatiotemporal metric called sensing-span is proposed to evaluate the tasks' execution cost of achieving desired data quality. We extend computation task scheduling a lgorithms to support sensor-channel co-allocation problem and a heuristic called Minimum Service Capability Fragment (MSCF) is introduced for task scheduling to minimize the waste of reserved channel capacity. Simulation results show that MSCF can improve the performance of data acquisition in WSNs as compared with other heuristics, when scheduling a large number of concurrent data acquisition tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance evaluation and optimization of random memory access on multicores with high productivity

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (406 KB) |  | HTML iconHTML  

    The slow progress in memory access latencies in comparison to CPU speeds has resulted in memory accesses dominating code performance. While architectural enhancements have benefited applications with data locality and sequential access, random memory access still remains a cause for concern. Several benchmarks have been proposed to evaluate the random memory access performance on multicore architectures. However, the performance evaluation models used by the existing benchmarks do not fully capture the varying types of random access behaviour arising in practical applications. In this paper, we propose a new model for evaluating the performance of random memory access that better captures the random access behaviour demonstrated by applications in practice. We use our model to evaluate the performance of two popular multicore architectures, the Cell and the GPU. We also suggest novel optimizations on these architectures that significantly boost the performance for random accesses in comparison to conventional architectures. Performance improvements on these architectures typically come at the cost of reduced productivity considering the extra programming effort involved. To address this problem, we propose libraries that incorporate these optimizations and provide innovatively designed programming interfaces that can be used by the applications to achieve good performance without loss of productivity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Anomaly detection in large-scale coalition clusters for dependability assurance

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1522 KB) |  | HTML iconHTML  

    In large-scale high-performance computing systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. When a compute node fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. Manual detection is time-consuming and error-prone. It does not scale well. In this paper, we present an autonomic mechanism for anomaly detection in coalition clusters. It is composed of a set of techniques that facilitates automatic analysis of system health data. We apply data transformation to format health data in a uniform manner. Then principal variables are chosen by feature selection, which reduces the data size. Clustering and outlier detection are explored to identify nodes with anomalous behavior. We evaluate our prototype implementation on a production institution-wide computational grid. The results show that our mechanism can effectively detect faulty nodes with high accuracy and low computation overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VCTlite: Towards an efficient implementation of virtual cut-through switching in on-chip networks

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (401 KB) |  | HTML iconHTML  

    On-chip networks have rapidly emerged as the best interconnection choice for high-core count chip multiprocessors (CMPs) because of the good scalability properties they present. Their fast evolution has been accelerated by the large inheritance from the offchip network domain. Many of the mechanisms and techniques previously developed in that area have been directly applied to the on-chip domain due to the perfect match between the features provided by those techniques and the requirements of on-chip networks. Other mechanisms have been adapted in order to fit the new environment needs. In this paper we present a new example of such an adaptation. Although wormhole switching was initially chosen as the switching mechanism that best fits the on-chip domain characteristics because of its well-known low input buffer requirements, in this paper we show that an efficient implementation of virtual cut-through switching, specially adapted to the particular characteristics of the CMP domain, is feasible as well. Our implementation of virtual cut-through switching, carried out in a 45 nm technology, demonstrates to be faster than a wormhole one, at the same time that does not require more area and reduces power consumption. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing an MPI weather forecasting model via processor virtualization

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (541 KB) |  | HTML iconHTML  

    Weather forecasting models are computationally intensive applications. These models are typically executed in parallel machines and a major obstacle for their scalability is load imbalance. The causes of such imbalance are either static (e.g. topography) or dynamic (e.g. shortwave radiation, moving thunderstorms). Various techniques, often embedded in the application's source code, have been used to address both sources. However, these techniques are inflexible and hard to use in legacy codes. In this paper, we demonstrate the effectiveness of processor virtualization for dynamically balancing the load in BRAMS, a mesoscale weather forecasting model based on MPI parallelization. We use the Charm++ infrastructure, with its over-decomposition and object-migration capabilities, to move subdomains across processors during execution of the model. Processor virtualization enables better overlap between computation and communication and improved cache efficiency. Furthermore, by employing an appropriate load balancer, we achieve better processor utilization while requiring minimal changes to the model's code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Link-heterogeneity vs. node-heterogeneity in clusters

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (435 KB) |  | HTML iconHTML  

    Heterogeneity in resources pervades all modern computing platforms. How do the effects of heterogeneity depend on which resources differ among computers in a platform? Some answers are derived within a formal framework, by comparing heterogeneity in computing power (node-heterogeneity) with heterogeneity in communication speed (link-heterogeneity). The former genre of heterogeneity seems much easier to understand than the latter. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic dataflow application tuning for heterogeneous systems

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (401 KB) |  | HTML iconHTML  

    Due to the increasing prevalence of multicore microprocessors and accelerator technologies in modern supercomputer design, new techniques for designing scientific applications are needed, in order to efficiently leverage all of the power inherent in these systems. The dataflow programming paradigm is well-suited to application design for distributed and heterogeneous systems than other techniques. Traditionally in dataflow middleware, application data domains are statically partitioned and distributed among the processors using a demand-driven algorithm. Unfortunately, this task scheduling technique can cause severe load imbalances in heterogeneous environments. Furthermore, in the presence of different types of processors, the optimum datasize can be different for each processor type. To solve the load imbalance problem and to leverage the optimum datasize dynamicity in a dataflow framework, we present an algorithm which automatically partitions the application workspace. By putting this partitioning into the purview of the dataflow runtime system, we can adaptively change the size of databuffers and correctly balance the load. Experiments with four applications show that our technique allows developers to skip the tedious and error-prone step of manually tuning the data granularity. Our technique is always competitive with the best-known data partitioning for these experiments, and can beat it under certain constraints. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A reliable data transport protocol for partitioned actors in Wireless Sensor and Actor Networks

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (266 KB) |  | HTML iconHTML  

    In Wireless Sensor and Actor Networks (WSANs), effective Actor-Actor Communication (AAC) is an important requirement for the timely responses to events reported by the sensors. However, due to scattered nature of events, mobility of actor nodes, and low density of actor nodes, the network of actor nodes tends to get partitioned frequently. To provide effective AAC in such situations, the energy-constrained sensor nodes located between the partitioned actor nodes need to be utilized. This solution for healing the actor network partitions should involve minimal use of the sensor nodes so that the network lifetime is maximized. In this work, we propose an energy-efficient Actor-Actor Reliable Transport Protocol (A2RT) for WSANs with actor nodes equipped with directional antennas and dual radio interfaces. Our proposed transport protocol consists of a transport wrapper and a dynamic priority scheduler. Using simulations, we show that our transport wrapper achieves high reliability with minimum retransmissions both under static and dynamic network topology conditions. The results also show that the traffic scheduler of our protocol helps to achieve the goals of real-time delivery by maximizing the number of packets that meet the delay constraints. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly scalable parallel collaborative filtering algorithm

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (467 KB) |  | HTML iconHTML  

    Collaborative filtering (CF) based recommender systems have gained wide popularity in Internet companies like Amazon, Netflix, Google News, and others. These systems make automatic predictions about the interests of a user by inferring from information about like-minded users. Realtime CF on highly sparse massive datasets, while achieving a high prediction accuracy, is a computationally challenging problem. In this paper, we present the design of a soft real-time (around 1 min.) parallel CF algorithm based on the Concept Decomposition technique. Our parallel algorithm has been optimized for multicore/many-core architectures while maintaining the prediction accuracy of 0.84 RMSE. Using the Netflix dataset, we demonstrate the performance and scalability of our algorithm (in both batch mode and online mode) on a 32-core Power6 based SMP system. Our parallel algorithm delivered training time of 64s on the full Netflix dataset and prediction time of 4.5s on 1.4M ratings (3.2/μs per rating prediction). This is 12.6× better than the best known sequential training time and around 33 × better than the best known sequential prediction time, along with high accuracy (0.84 RMSE). To the best of our knowledge, this is also the best known parallel performance at such high accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EMC2: Extending Magny-Cours coherence for large-scale servers

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (481 KB) |  | HTML iconHTML  

    The demand of larger and more powerful high-performance shared-memory servers is growing over the last few years. To meet this need, AMD has recently launched the twelve-core Magny-Cours processors. They include a directory cache (Probe Filter) that increases the scalability of the coherence protocol applied by Opterons, based on coherent Hyper Transport interconnect (cHT). cHT limits up to 8 the number of nodes that can be addressed. Recent High Node Count HT specification overcomes this limitation. However, the 3-bit pointer used by the Probe Filter prevents Magny-Cours-based servers from being built beyond 8 nodes. In this paper, we propose and develop an external logic to extend the coherence domain of Magny-Cours processors beyond the 8-node limit while maintaining the advantages provided by the Probe Filter. Evaluation results for up to a 32-node system show how the performance offered by our solution scales with the increment in the number of nodes, enhancing the Probe Filter effectiveness by filtering additional messages. Particularly, we reduce runtime by 47% in a 32-die system respect to the 8-die Magny-Cours system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A study of memory-aware scheduling in message driven parallel programs

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1156 KB) |  | HTML iconHTML  

    This paper presents a simple, but powerful memory-aware scheduling mechanism that adaptively schedules tasks in a message driven distributed-memory parallel program. The scheduler adapts its behavior whenever memory usage exceeds a threshold by scheduling tasks known to reduce memory usage. The usefulness of the scheduler and its low overhead are demonstrated in the context of an LU matrix factorization program. In the LU program, only a single additional line of code is required to make use of the new general-purpose memory-aware scheduling mechanism. Without memory-aware scheduling, the LU program can only run with small problem sizes, but with the new memory-aware scheduling, the program scales to larger problem sizes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power-efficient workload distribution for virtualized server clusters

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (485 KB) |  | HTML iconHTML  

    With growing cost of electricity, the power management of server clusters has become an important problem. Most existing research, however, either do not apply to virtualized environments or do not focus on power-efficient workload distribution. To fill in this research gap, we propose a workload distribution algorithm for virtualized server clusters to reduce their power consumptions and provide quality of service (QoS). Built upon optimization, queuing theory and control theory techniques, our approach achieves the design goal, where QoS is provided to a larger number of requests with a smaller amount of power consumption. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An SOA approach to high performance scientific computing: Early experiences

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (359 KB) |  | HTML iconHTML  

    Service Oriented Architecture (SOA) has been embraced in enterprise computing for several years. The scientific community always felt the need of an SOA infrastructure not only with the convenience of enterprise SOA but also with expected level of high performance capabilities. Our research has produced an SOA middleware (ANU-SOAM) which supports an already popular enterprise SOA middleware API (Platform Symphony API) with the desired level of performance for scientific computations such as a Conjugate Gradient Solver. W e have extended the compute services of ANU-SOAM with a common data service (CDS) between client and the service instances. The aim is to improve performance of applications by reducing communications or communication cost between the client and the service instances with the help of CDS. This is achieved by enabling tasks to perform a deferred put operation to the common data their service instances, with the results of the put operation only being visible to the next generation of tasks. These updates can be synchronised (committed) at CDS at the direction of the client. This property enables applications on ANU-SOAM to overcome latency of poor networks (or `cloud') between client and service instances. Experimental results on a small Gigabit ethernet cluster show that, for the Conjugate Gradient Solver, the ANU-SOAM version suffers no appreciable performance loss over MPI versions and the CDS enhances N-Body Solver performance, with good scalability in both cases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A space-efficient parallel algorithm for computing betweenness centrality in distributed memory

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (407 KB) |  | HTML iconHTML  

    Betweenness centrality is a measure based on shortest paths that attempts to quantify the relative importance of nodes in a network. As computation of betweenness centrality becomes increasingly important in areas such as social network analysis, networks of interest are becoming too large to fit in the memory of a single processing unit, making parallel execution a necessity. Parallelization over the vertex set of the standard algorithm, with a final reduction of the centrality for each vertex, is straightforward but requires Ω(|V|2) storage. In this paper we present a new parallelizable algorithm with low spatial complexity that is based on the best known sequential algorithm. Our algorithm requires O(|V| + |E|) storage and enables efficient parallel execution. Our algorithm is especially well suited to distributed memory processing because it can be implemented using coarse-grained parallelism. The presented time bounds for parallel execution of our algorithm on CRCW PRAM and on distributed memory systems both show good asymptotic performance. Experimental results with a distributed memory computer show the practical applicability of our algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Okeanos: Reconfigurable fault-tolerant transactional storage supporting object deletions

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (462 KB) |  | HTML iconHTML  

    Over the past years, many peer-to-peer (P2P) distributed hash-tables (DHTs) have been proposed. Given their excellent scalability properties they are nowadays core technologies used in the industry, e.g. by Amazon and Facebook. However, most DHTs still exhibit major drawbacks limiting their applicability: They hardly give consistency guarantees and do not support transactional semantics, which precludes applications that rely on strong consistency (e.g. banking, trading, accounting). Second, a key functionality yet missing is the ability to consistently delete data objects physically from master-less data partitions. All existing DHTs either use soft-state objects that expire after a certain amount of time or simulate deletions by marking objects as deleted but keep them allocated. In this paper, we present Okeanos, the first fault-tolerant transactional master-less key/value store supporting true physical deletions. Okeanos itself is no DHT, but can be used as a building block to implement consistent partitions of larger distributed storage systems. Further, the nodes that host an Okeanos store can be exchanged at runtime (reconfiguration) without significant times of unavailability. We intend to use Okeanos to build a reliable large-scale P2P storage system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.