By Topic

Data Engineering, 2004. Proceedings. 20th International Conference on

Date 30 March-2 April 2004

Filter Results

Displaying Results 1 - 25 of 135
  • ItCompress: an iterative semantic compression algorithm

    Page(s): 646 - 657
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (494 KB) |  | HTML iconHTML  

    Real datasets are often large enough to necessitate data compression. Traditional 'syntactic' data compression methods treat the table as a large byte string and operate at the byte level. The tradeoff in such cases is usually between the ease of retrieval (the ease with which one can retrieve a single tuple or attribute value without decompressing a much larger unit) and the effectiveness of the compression. In this regard, the use of semantic compression has generated considerable interest and motivated certain recent works. We propose a semantic compression algorithm called ItCompress ITerative Compression, which achieves good compression while permitting access even at attribute level without requiring the decompression of a larger unit. ItCompress iteratively improves the compression ratio of the compressed output during each scan of the table. The amount of compression can be tuned based on the number of iterations. Moreover, the initial iterations provide significant compression, thereby making it a cost-effective compression technique. Extensive experiments were conducted and the results indicate the superiority of ItCompress with respect to previously known techniques, such as 'SPARTAN' and 'fascicles'. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • From sipping on a straw to drinking from a fire hose: data integration in a public genome database

    Page(s): 795 - 798
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    Biology is a vast domain. The Mouse Genome Informatics (MGI) system, which focuses on the biology of the laboratory mouse, covers only a small, carefully chosen slice. Nevertheless, we deal with data of immense variety, deep complexity, and exponentially growing volume. Our role as an integration nexus is to add value by combining data sets of diverse types and origins, eliminating redundancy and resolving conflicts. We briefly describe some of the issues we face and approaches we have adopted to the integration problem. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unordered tree mining with applications to phylogeny

    Page(s): 708 - 719
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (553 KB) |  | HTML iconHTML  

    Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data, such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. We present a new FSM technique for finding patterns in rooted unordered labeled trees. The patterns of interest are cousin pairs in these trees. A cousin pair is a pair of nodes sharing the same parent, the same grandparent, or the same great-grandparent, etc. Given a tree T, our algorithm finds all interesting cousin pairs of T in O(|T|2) time where |T| is the number of nodes in T. Experimental results on synthetic data and phylogenies show the scalability and effectiveness of the proposed technique. To demonstrate the usefulness of our approach, we discuss its applications to locating co-occurring patterns in multiple evolutionary trees, evaluating the consensus of equally parsimonious trees, and finding kernel trees of groups of phylogenies. We also describe extensions of our algorithms for undirected acyclic graphs (or free trees). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient incremental validation of XML documents

    Page(s): 671 - 682
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (404 KB) |  | HTML iconHTML  

    We discuss incremental validation of XML documents with respect to DTDs and XML schema definitions. We consider insertions and deletions of subtrees, as opposed to leaf nodes only, and we also consider the validation of ID and IDREF attributes. For arbitrary schemas, we give a worst-case n log n time and linear space algorithm, and show that it often is far superior to revalidation from scratch. We present two classes of schemas, which capture most real-life DTDs, and show that they admit a logarithmic time incremental validation algorithm that, in many cases, requires only constant auxiliary space. We then discuss an implementation of these algorithms that is independent of, and can be customized for different storage mechanisms for XML. Finally, we present extensive experimental results showing that our approach is highly efficient and scalable. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Approximate selection queries over imprecise data

    Page(s): 140 - 151
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (563 KB) |  | HTML iconHTML  

    We examine the problem of evaluating selection queries over imprecisely represented objects. Such objects are used either because they are much smaller in size than the precise ones (e.g., compressed versions of time series), or as imprecise replicas of fast-changing objects across the network (e.g., interval approximations for time-varying sensor readings). It may be impossible to determine whether an imprecise object meets the selection predicate. Additionally, the objects appearing in the output are also imprecise. Retrieving the precise objects themselves (at additional cost) can be used to increase the quality of the reported answer. We allow queries to specify their own answer quality requirements. We show how the query evaluation system may do the minimal amount of work to meet these requirements. Our work presents two important contributions: first, by considering queries with set-based answers, rather than the approximate aggregate queries over numerical data examined in the literature; second, by aiming to minimize the combined cost of both data processing and probe operations in a single framework. Thus, we establish that the answer accuracy/performance tradeoff can be realized in a more general setting than previously seen. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GODIVA: lightweight data management for scientific visualization applications

    Page(s): 732 - 743
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (327 KB) |  | HTML iconHTML  

    Scientific visualization applications are very data-intensive, with high demands for I/O and data management. Developers of many visualization tools hesitate to use traditional DBMSs, due to the lack of support for these DBMSs on parallel platforms and the risk of reducing the portability of their tools and the user data. We propose the GODIVA framework, which provides simple database-like interfaces to help visualization tool developers manage their in-memory data, and I/O optimizations such as prefetching and caching to improve input performance at run time. We implemented the GODIVA interfaces in a stand-alone, portable user library, which can be used by all types of visualization codes: interactive and batch-mode, sequential and parallel. Performance results from running a visualization tool using the GODIVA library on multiple platforms show that the GODIVA framework is easy to use, alleviates developers' data management burden, and can bring substantial I/O performance improvement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EShopMonitor: a Web content monitoring tool

    Page(s): 817 - 820
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (279 KB) |  | HTML iconHTML  

    Data presented on commerce sites runs into thousands of pages, and is typically delivered from multiple back-end sources. This makes it difficult to identify incorrect, anomalous, or interesting data such as $9.99 air fares, missing links, drastic changes in prices and addition of new products or promotions. We describe a system that monitors Web sites automatically and generates various types of reports so that the content of the site can be monitored and the quality maintained. The solution designed and implemented by us consists of a site crawler that crawls dynamic pages, an information miner that learns to extract useful information from the pages based on examples provided by the user, and a reporter that can be configured by the user to answer specific queries. The tool can also be used for identifying price trends and new products or promotions at competitor sites. A pilot run of the tool has been successfully completed at the ibm.com site. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Web service composition through declarative queries: the case of conjunctive queries with union and negation

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (228 KB) |  | HTML iconHTML  

    A Web service operation can be seen as a function op: X1,..., Xn → Y1,..., Ym having an input message (request) with n arguments (parts), and an output message (response) with m parts. We study the problem of deciding whether a query Q is feasible, i.e., whether there exists a logically equivalent query Q' that can be executed observing the limited access patterns given by the Web service (source) relations. Executability depends on the specific syntactic form of a query, while feasibility is a more "robust" semantic notion, involving all equivalent queries (i.e., reorderings, minimized queries, etc). We show that deciding query feasibility (called "stability") is NP-complete for conjunctive queries (CQ) and for conjunctive queries with union (UCQ). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining the Web for generating thematic metadata from textual data

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (210 KB) |  | HTML iconHTML  

    Conventional tools for automatic metadata creation mostly extract named entities or patterns from texts and annotate them with information about persons, locations, dates, and so on. However, this kind of entity type information is often too primitive for more advanced intelligent applications such as concept-based search. Here, we try to generate semantically-deep metadata with limited human intervention. The main idea behind our approach is to use Web mining and categorization techniques to create thematic metadata. The proposed approach, comprises of three computational modules: feature extraction, HCQF (hier-concept query formulation) and text instance categorization. The feature extraction module sends the name of text instances to Web search engines, and the returned highly-ranked search-result pages are used to describe them. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Superimposed applications using SPARCE

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (315 KB) |  | HTML iconHTML  

    People often impose new interpretations onto existing information. In the process, they work with information in two layers: a base layer, where the original information resides, and a superimposed layer, where only the new interpretations reside. Abstractions defined in the Superimposed Pluggable Architecture for Contexts and Excerpts (SPARCE) ease communication between the two layers. SPARCE provides three key abstractions for superimposed information management: mark, context, and excerpt. We demonstrate two applications, RIDPad and Schematics Browser, for use in the appeal process of the US Forest Service (USFS). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Querying about the past, the present, and the future in spatio-temporal databases

    Page(s): 202 - 213
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (583 KB) |  | HTML iconHTML  

    Moving objects (e.g., vehicles in road networks) continuously generate large amounts of spatio-temporal information in the form of data streams. Efficient management of such streams is a challenging goal due to the highly dynamic nature of the data and the need for fast, online computations. We present a novel approach for approximate query processing about the present, past, or the future in spatio-temporal databases. In particular, we first propose an incrementally updateable, multidimensional histogram for present-time queries. Second, we develop a general architecture for maintaining and querying historical data. Third, we implement a stochastic approach for predicting the results of queries that refer to the future. Finally, we experimentally prove the effectiveness and efficiency of our techniques using a realistic simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ContextMetrics: semantic and syntactic interoperability in cross-border trading systems

    Page(s): 808 - 811
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (235 KB) |  | HTML iconHTML  

    We describe a method and system for quantifying the variances in the semantics and syntax of electronic transactions exchanged between business counterparties. ContextMetrics enables (a) dynamic transformations of outbound and inbound transactions needed to effect 'straight-through-processing' (STP); (b) unbiased assessments of counterparty systems' capabilities to support STP; and (c) modeling of operational risks and financial exposures stemming from an enterprise's transactional systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient execution of computation modules in a model with massive data

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (322 KB) |  | HTML iconHTML  

    Models and simulations for analyzing difficult real-world problems often deal with massive amounts of data. The data problem is compounded when the analysis must be repeatedly run to perform what-if analyses. One such model is the Integrated Consumable Item Support (ICIS) Model developed for the U.S. Defense Logistics Agency (DLA). It models DLA's ability to satisfy future wartime requirements for parts, fuel, and food. ICIS uses a number of computation modules to project demands, model sourcing, and identify potential problem items for various commodities and military services. These modules are written in a variety of computer languages and must work together to generate an ICIS analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • XJoin index: indexing XML data for efficient handling of branching path expressions

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    We consider the problem of indexing XML data for solving branching path expressions with the aim of reducing the number of joins to be executed and we propose a simple yet efficient join indexing approach to shrink the twig before applying any structural join algorithm. The indexing technique we propose, that we call XJoin Index, precomputes some structural (semi-)join results thus reducing the number of joins to be computed. Precomputed (semi-)joins support the following operations: (i) attribute selections, possibly involving several attributes; (ii) detection of parent-child relationships; (ii) counting selections, like Find all books with at least 3 authors. Unlike other approaches, based on specialized data structures XJoin Index is entirely based on B+-trees and can be coupled with any structural join algorithm proposed so far. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Routing XML queries

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (207 KB) |  | HTML iconHTML  

    In file-sharing P2P networks, a fundamental problem is that of identifying databases that are relevant to user queries. This problem is referred to as the location problem in P2P literature. We propose a scalable solution to the location problem in a data-sharing P2P network, consisting of a network of XML database nodes and XML router nodes, and make the following contributions. We develop the internal organization and routing protocols for the XML router nodes, to enable scalable XPath query and update processing, under the open and the agreement cooperation models between nodes. Since router nodes tend to be memory constrained, we facilitate a space/performance tradeoff by permitting aggregated routing states, and developing algorithms for generating and using such aggregated information. We experimentally demonstrate the scalability of our approach, and the performance of our query and update protocols, using a detailed simulation model, varying key design parameters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A frequency-based approach for mining coverage statistics in data integration

    Page(s): 387 - 398
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7827 KB) |  | HTML iconHTML  

    Query optimization in data integration requires source coverage and overlap statistics. Gathering and storing the required statistics presents many challenges, not the least of which is controlling the amount of statistics learned. We introduce StatMiner, a novel statistics mining approach which automatically generates attribute value hierarchies, efficiently discovers frequently accessed query classes based on the learned attribute value hierarchies, and learns statistics only with respect to these classes. We describe the details of our method, and present experimental results demonstrating the efficiency and effectiveness of our approach. Our experiments are done in the context of BibFinder, a publicly fielded bibliography mediator. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Publish/subscribe in NonStop SQL: transactional streams in a relational context

    Page(s): 821 - 824
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (251 KB) |  | HTML iconHTML  

    Relational queries on continuous streams of data are the subject of many recent database research projects. In 1998 a small group of people started a similar project with the goal to transform our product, NonStop SQL/MX, into an active RDBMS. This project tried to integrate functionality of transactional queuing systems with relational tables and with SQL, using simple extensions to the SQL syntax and guaranteeing clearly defined query and transactional semantics. The result is the first commercially available RDBMS that incorporates streams. All data flowing through the system is contained in relational tables and is protected by ACID transactions. Insert and update operations on any NonStop SQL table can be considered publishing of data and can therefore be transparent to the (legacy) applications performing them. Unlike triggers, the publish operation does not increase the path length of the application and it allows the subscriber to execute in a separate transaction. Subscribers, using an extended SQL syntax, see a continuous stream of data, consisting of all rows originally in the table plus all rows that are inserted or updated thereafter. The system scales by using partitioned tables and therefore partitioned streams. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving hash join performance through prefetching

    Page(s): 116 - 127
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (457 KB) |  | HTML iconHTML  

    Hash join algorithms suffer from extensive CPU cache stalls. We show that the standard hash join algorithm/or disk-oriented databases (i.e. GRACE) spends over 73% of its user time stalled on CPU cache misses, and explores the use of prefetching to improve its cache performance. Applying prefetching to hash joins is complicated by the data dependencies, multiple code paths, and inherent randomness of hashing. We present two techniques, group prefetching and software-pipelined prefetching, that overcome these complications. These schemes achieve 2.0-2.9X speedups for the join phase and 1.4-2.6X speedups for the partition phase over GRACE and simple prefetching approaches. Compared with previous cache-aware approaches (i.e. cache partitioning), the schemes are at least 50% faster on large relations and do not require exclusive use of the CPU cache to be effective. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • A type-safe object-oriented solution for the dynamic construction of queries

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (213 KB) |  | HTML iconHTML  

    Many object-oriented applications use large numbers of structurally different database queries. With current technology, writing applications that generate queries at runtime is difficult and error-prone. FROQUE, a framework for object-oriented queries, provides a secure and purely object-oriented solution to access relational databases. As such, it is easy to use for object-oriented programmers and with the help of object-oriented compilers it guarantees that queries formulated in the object-oriented world at execution time result in correct SQL queries. Thus, FROQUE is an improvement over existing database frameworks such as Apache OJB, the object relational bridge, which are not strongly typed and can lead to runtime errors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-scale histograms for answering queries over time series data

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (202 KB) |  | HTML iconHTML  

    Similarity-based time series data retrieval has been used in many real world applications, such as stock data or weather data analysis. Two types of queries on time series data are generally studied: pattern existence queries and exact match queries. Here, we describe a technique to answer both pattern existence queries and exact match queries. A typical application that needs answers to both queries is an interactive analysis of time series data. We propose a histogram-based representation to approximate time series data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FLYINGDOC: an architecture for distributed, user-friendly, and personalized information systems

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (205 KB) |  | HTML iconHTML  

    The need for personal information management using distributed, user-friendly, and personalized document management systems is obvious. State of the art document management systems such as digital libraries provide support for the whole document lifecycle. To enhance such document management systems to get a personalized, distributed and user-friendly information system we present techniques for a simple import of collections, documents, and data, for generic and concrete data modeling, replication, and, personalization. These techniques were employed for the implementation of a personal conference assistant, which was used for the first time at the VLDB conference 2003 in Berlin, Germany. Our client-server architecture provides an information server with different services and different kinds of clients. These services comprise a distribution and replication service, a collection integration service, a data management unit, and, a query processing service. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Priority mechanisms for OLTP and transactional Web applications

    Page(s): 535 - 546
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (364 KB) |  | HTML iconHTML  

    Transactional workloads are a hallmark of modern OLTP and Web applications, ranging from electronic commerce and banking to online shopping. Often, the database at the core of these applications is the performance bottleneck. Given the limited resources available to the database, transaction execution times can vary wildly as they compete and wait for critical resources. As the competitor is "only a click away", valuable (high-priority) users must be ensured consistently good performance via QoS and transaction prioritization. This paper analyzes and proposes prioritization for transactional workloads in traditional database systems (DBMS). This work first performs a detailed bottleneck analysis of resource usage by transactional workloads on commercial and noncommercial DBMS (IBM DB2, Post-greSQL, Shore) under a range of configurations. Second, this work implements and evaluates the performance of several preemptive and nonpreemptive DBMS prioritization policies in PostgreSQL and Shore. The primary contributions of this work include (i) understanding the bottleneck resources in transactional DBMS workloads and (ii) a demonstration that prioritization in traditional DBMS can provide 2x-5x improvement for high-priority transactions using simple scheduling policies, without expense to low-priority transactions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hash-merge join: a non-blocking join algorithm for producing fast and early join results

    Page(s): 251 - 262
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6014 KB) |  | HTML iconHTML  

    We introduce the hash-merge join algorithm (HMJ, for short); a new nonblocking join algorithm that deals with data items from remote sources via unpredictable, slow, or bursty network traffic. The HMJ algorithm is designed with two goals in mind: (1) minimize the time to produce the first few results, and (2) produce join results even if the two sources of the join operator occasionally get blocked. The HMJ algorithm has two phases: The hashing phase and the merging phase. The hashing phase employs an in-memory hash-based join algorithm that produces join results as quickly as data arrives. The merging phase is responsible for producing join results if the two sources are blocked. Both phases of the HMJ algorithm are connected via a flushing policy that flushes in-memory parts into disk storage once the memory is exhausted. Experimental results show that HMJ combines the advantages of two state-of-the-art nonblocking join algorithms (XJoin and Progressive Merge Join) while avoiding their shortcomings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.