By Topic

Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings. Eighth International Conference on

Date 26-28 March 2003

Filter Results

Displaying Results 1 - 25 of 43
  • Proceedings Eighth International Conference on Database Systems for Advanced Applications

    Save to Project icon | Request Permissions | PDF file iconPDF (286 KB)  
    Freely Available from IEEE
  • A survey of new directions in database systems

    Page(s): 3
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (182 KB) |  | HTML iconHTML  

    Summary form only given, as follows. As database system research evolves, there are several enduring themes. One, of course, is how we deal with the largest possible amounts of data. A less obvious theme is optimization ?? it is an essential ingredient of all modern forms of database system. Because we deal with large volumes of data, we are often forced to process that data in regular ways. But when operations are uniform, there is an opportunity for the use of very-high-level languages, of which SQL is the primary example. However, to make a very-high-level language effective, we need to optimize it well, that is, produce effective query plans from all sorts of queries. In this talk, we shall review the principal directions in which modern database research is going, and in each case talk a bit about the optimization problems. Stream management systems are one very important new area. Another is peer-to-peer database systems. Integration of heterogeneous information, especially in virtual databases, is also a major challenge. XML, XQUERY, and semistructured data in general form yet another research opportunity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Similarity join for low-and high-dimensional data

    Page(s): 7 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (393 KB) |  | HTML iconHTML  

    The efficient processing of similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focussed on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper we develop two new spatial join algorithms, the Grid-join and EGO-join, and study their performance in comparison to the state of the art algorithm EGO-join and the RSJ algorithm. Through evaluation we explore the domain of applicability of each algorithm and provide recommendations for the choice of join algorithm depending upon the dimensionality of the data as well as the critical /spl epsiv/ parameter. We also point out the significance of the choice of this parameter for ensuring that the selectivity achieved is reasonable. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spatial query processing for high resolutions

    Page(s): 17 - 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (867 KB) |  | HTML iconHTML  

    Modern database applications including computer-aided design (CAD), medical imaging, or molecular biology impose new requirements on spatial query processing. Particular problems arise from the need of high resolutions for very large spatial objects, including cars, space stations, planes and industrial plants, and from the design goal to use general purpose database management systems in order to guarantee industrial-strength. In the past two decades, various stand-alone spatial index structures have been proposed but their integration into fully-fledged database systems is problematic. Most of these approaches are based on decomposition of spatial objects leading to replicating index structures. In contrast to common black-and-white decompositions which suffer from the lack of intermediate solutions, we introduce grey approximations as a new and general concept. We demonstrate the benefits of grey approximations in the context of encoding spatial objects by space filling curves resulting in grey interval sequences. Spatial intersection queries are then processed by a filter and refine architecture which, as an important design goal, can purely be expressed by means of the SQL: 1999 standard. Our new High Resolution Indexing (HRI) method can easily be integrated into general purpose DBMSs. The experimental evaluation on real-world test data from car and plane design projects points out that our new concept outperforms competitive techniques that are implementable on top of a standard object-relational DBMS by an order of magnitude with respect to secondary storage space and overall query response time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective similarity search on voxelized CAD objects

    Page(s): 27 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (634 KB) |  | HTML iconHTML  

    Similarity search in database systems is becoming an increasingly important task in modern application domains such as multimedia, molecular biology, medical imaging and many others. Especially for CAD applications, suitable similarity models and a clear representation of the results can help to reduce the cost of developing and producing new parts by maximizing the reuse of existing parts. In this paper, we adapt two known similarity models to voxelized 3-D CAD data and introduce a new model based on eigenvectors. The experimental evaluation of our three similarity models is based on two real-world test datasets. Furthermore, we introduce hierarchical clustering as a new and effective way to analyse and compare similarity models. We show that both our similarity model as well as our evaluation procedure are suitable for industrial use. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovering direct and indirect matches for schema elements

    Page(s): 39 - 46
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (312 KB) |  | HTML iconHTML  

    Automating schema matching is challenging. Previous approaches to automating schema matching focus on computing direct element matches between two schemas. Schemas, however rarely match directly. Thus, to complete the task of schema matching, we must also compute indirect element matches. In this paper we present a framework for generating direct as well as many indirect element matches between a source schema and a target schema. Recognizing expected data values associated with schema elements and applying schema-structure heuristics are the key ideas to computing indirect matches. Experiments we have conducted over several real-world application domains show encouraging results, yielding over 90% precision and recall for both direct and indirect element matches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Gangam: a transformation modeling framework

    Page(s): 47 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (521 KB) |  | HTML iconHTML  

    Integration of multiple heterogeneous data sources continues to be a critical problem for many application domains and a challenge for researchers world-wide. One aspect of integration is the translation of schema and data across data model boundaries. Researchers in the past have looked at both customized algorithmic approaches as well as generic meta-modeling approaches as viable solutions. We now take the meta-modeling approach the next step-forward. In this paper we propose a flexible, extensible and re-usable transformation modeling framework which allows users to (1) model their transformations; (2) to choose from a set of possible execution strategies to translate the underlying schema and data; and (3) to access and re-use a library of transformation generators. In this paper we present the core of our modeling framework - a set of cross algebra operators that covers the class of linear transformations, and two different techniques of composing these operators into larger transformation expressions. We also present an evaluation strategy to execute the modeled transformation, and thereby transform the input schema and data into the target schema and data assuming that data model wrappers are provided for each data model. The proposed framework has been implemented, and we give an overview of this prototype system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Securing your data in agent-based P2P systems

    Page(s): 55 - 62
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (348 KB) |  | HTML iconHTML  

    Peer-to-peer (P2P) technology can be naturally integrated with mobile agent technology in Internet applications, taking advantage of the autonomy, mobility, and efficiency of mobile agents in accessing and processing data. We address the problem of protecting critical information in agent-based P2P Internet applications under two different scenarios. First, we assume the route of a mobile agent in the P2P system is fixed. Under this assumption, we propose the usage of an efficient parallel dispatch model where the agent's route is signcrypted at the first step and dispatched to each new peer to collect information. Then, we assume the route is not specified and we propose the usage of a modified multi-signcryption scheme to guarantee protection. Based on this second approach, a mobile agent determines the next peer to communicate with independently and information is collected dynamically in one round of visiting a group of peers. Security issues under the two proposed models are then discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ascending frequency ordered prefix-tree: efficient mining of frequent patterns

    Page(s): 65 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (326 KB) |  | HTML iconHTML  

    Mining frequent patterns is a fundamental and important problem in many data mining applications. Many of the algorithms adopt the pattern growth approach, which is shown to be superior to the candidate generate-and-test approach significantly. We identify the key factors that influence the performance of the pattern growth approach, and optimize them to further improve the performance. Our algorithm uses a simple while compact data structure-ascending frequency ordered prefixtree (AFOPT) to organize the conditional databases, in which we use arrays to store single branches to further save space. We traverse our prefix-tree structure using a top-down strategy. Our experiment results show that the combination of the top-down traversal strategy and the ascending frequency item ordering method achieves significant performance improvement over previous works. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient sliding window algorithm for detection of sequential patterns

    Page(s): 73 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (409 KB) |  | HTML iconHTML  

    Recently a growing number of applications monitor the physical world by tracking sensor data and detecting values, trends or patterns of interest. We focus on the problem of detecting sequential patterns with complex predicates over sensor data, and present an algorithm that efficiently pre-computes which pattern predicates' checks can be skipped at query compile-time, so that the processing window can slide with only necessary checks being actually performed against the sensor data at run-time. Implementation and evaluation of the proposed approach confirms its efficiency when compared to previously proposed approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Caucus-based transaction clustering

    Page(s): 81 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (686 KB) |  | HTML iconHTML  

    Transaction clustering has received attention in recent developments of data mining. Traditional clustering methods are not useful to solve this problem. Transaction data sets are different from the traditional data sets in their high dimensionality, sparsity and numerous outliers. We introduce a new efficient algorithm for transaction clustering. The proposed algorithm is based on a caucus, which is fine-partitioned demographic groups based on purchase features of customers. Due to the important role caucus plays, we also present a heuristic method of caucus generation with the use of entropy. Experiments on real and synthetic data sets show that our approach can achieve a better result than existed methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TAX-PQ: dynamic taxonomy probing and query modification for topic-focused Web search

    Page(s): 91 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (491 KB) |  | HTML iconHTML  

    We propose a novel Web search scheme TAX-PQ. TAX-PQ enables taxonomy-based topic-focused Web search on ordinary Boolean Web search interfaces. TAX-PQ utilizes a taxonomy and the data set maintained in an existing taxonomy-based search facility for this purpose. The search is initiated by designating an initial query and a context category in the taxonomy. The data set in the taxonomy-based search facility is probed with a technique combining the initial query with sampling, and a decision tree is constructed from the sampled query result. A query modifier is then derived from the decision tree to focus the initial query on the selected context category. To adapt TAX-PQ to different user requirements on search result effectiveness and properties of target Web search interfaces, we have developed a new decision tree construction algorithm. Experiments involving real Web sites show that TAX-PQ can significantly improve the Web search process and result. The results comply with user requirements under constraints of the target Web search interfaces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Finding a Web community by maximum flow algorithm with HITS score based capacity

    Page(s): 101 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (321 KB) |  | HTML iconHTML  

    We propose an edge capacity based on hub and authority scores, and examine the effects of using the edge capacity on the method for extracting Web communities using maximum flow algorithm proposed by G. Flake et al. (2000). A Web community is a collection of Web pages in which a common (or related) topic is taken up. In recent years, various methods for finding Web communities have been proposed. G. Flake et al.'s method, which is based on maximum flow algorithm, has a big advantages: "topic drift" does not easily occur. On the other hand, it sets the edge capacity to a fixed value for every edge, which is one of the major cause of failing to obtain a proper Web community. Our approach, which is using HITS score based edge capacity, effectively extracts Web pages retaining well-balanced in both global and local relations to the given seed node. We examined the effects by the experiments for randomly selected 20 topics using Web archives in Japan crawled in 2002. The result confirmed that the average precision rose approximately 20%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable view expansion in a peer mediator system

    Page(s): 107 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (343 KB) |  | HTML iconHTML  

    To integrate many data sources we use a peer mediator-framework where views defined in the peers are logically composed in terms of each other A common approach to execute queries over mediators is to treat views in data sources as 'black boxes'. The mediators locally decompose queries into query fragments and submit them to the data sources for processing. Another approach, used in distributed DBMSs, is to treat the views as 'transparent boxes' by importing and fully expanding all views and merge them with the query. The black box approach often leads to inefficient query plans. However, in a peer mediator framework full view expansion (VE) leads to prohibitively long query compilation times when many peers are involved. It also limits peer autonomy since peers must reveal their view definitions. We investigate in a peer mediator framework the tradeoffs between none, partial, and full VE in two different distributed view composition scenarios. We show that it is often favorable with respect to query execution and sometimes even with respect to query compilation time to expand those views having common hidden peer subviews. However, in other cases it is better to use the 'black box' approach, in particular when peer autonomy prohibits view importation. Based on this, a hybrid strategy for VE in peer mediators is proposed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining emerging substrings

    Page(s): 119 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB) |  | HTML iconHTML  

    We introduce a new type of KDD patterns called emerging substrings. In a sequence database, an emerging substring (ES) of a data class is a substring which occurs more frequently in that class rather than in other classes. ESs are important to sequence classification as they capture significant contrasts between data classes and provide insights for the construction of sequence classifiers. We propose a suffix tree-based framework for mining ESs, and study the effectiveness of applying one or more pruning techniques in different stages of our ES mining algorithm. Experimental results show that if the target class is of a small population with respect to the whole database, which is the normal scenario in single-class ES mining, most of the pruning techniques would achieve considerable performance gain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast text classification: a training-corpus pruning based approach

    Page(s): 127 - 136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (453 KB) |  | HTML iconHTML  

    With the rapid growth of on-line information available, text classification is becoming more and more important. kNN is a widely used text classification method of high performance. However, this method is inefficient because it requires a large amount of computation for evaluating the similarity between a test document and each training document. In this paper, we propose a fast kNN text classification approach based on pruning the training corpus. By using this approach, the size of training corpus can be condensed sharply so that time-consuming on kNN searching can be cut off significantly, and consequently classification efficiency can be improved substantially while classification performance is preserved comparable to that of without pruning. Effective, algorithm for text corpus pruning is designed. Experiments over the Reuters corpus are carried out, which validate the practicability of the proposed approach. Our approach is especially suitable for on-line text classification applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient record linkage in large data sets

    Page(s): 137 - 146
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (494 KB) |  | HTML iconHTML  

    This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. The record-linkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. We explore a novel approach to this problem. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we use the FastMap approach as an example. Given the merging rule that defines when two records are Similar a set of attributes are chosen along which the merge will proceed A multidimensional similarity join over the chosen attributes is used to determine similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maintenance of partial-sum-based histograms

    Page(s): 149 - 156
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB) |  | HTML iconHTML  

    This paper introduces an efficient method for the maintenance of wavelet-based histograms built on partial sums. Wavelet-based histograms can be constructed from either raw data distributions or partial sums. The two construction methods have their own merits. Previous works have only focused on the maintenance of raw-data-based histograms. However it is highly inefficient to apply directly their techniques to partial-sum-based histograms because a single data update would trigger changes of multiple partial sums, which in turn, would trigger large amounts of computation on the changes of the wavelet decomposition. We present a novel technique to compute the effects of data updates on the wavelet decomposition of partial sums. Moreover, we point out some special features of the wavelet decomposition of partial sums and adapt a probabilistic counting technique for the maintenance of partial-sum-based histograms. Experimental results show that our maintenance method is efficient and its accuracy is robust to changing data distributions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Selectivity estimation using orthogonal series

    Page(s): 157 - 164
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    Selectivity estimation is an integral part of query optimization. In this paper, we propose a novel approach to approximate data density functions of relations and use them to estimate selectivities. A data density function here is approximated by a partial sum of an orthogonal series. Such approximate density functions can be derived easily, stored efficiently, and maintained dynamically. Experimental results show that our approach yields comparable or better estimation accuracy than the Wavelet and DCT methods, especially in the high dimensional spaces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Error minimization for approximate computation of range aggregates

    Page(s): 165 - 172
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    Histogram techniques are widely used in commercial database management systems for an estimation of query results. Recently, they have been also used in approximately, processing database queries, especially aggregation queries. Existing research results in this area have been mainly focused on constructing a histogram to approximately represent, as accurate as possible on an intuitive base, the original data frequencies. We propose a novel histogram construction method aiming to minimize the average approximate aggregation errors; and we have developed an efficient algorithm to construct near optimal histograms to achieve this goal. Our experiment results showed that the new histogram construction techniques lead to more accurate results than those by existing histogram techniques, and also out-perform the existing wavelet techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Q+Rtree: efficient indexing for moving object databases

    Page(s): 175 - 182
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (306 KB) |  | HTML iconHTML  

    Moving object environments contain large numbers of queries and continuously moving objects. Traditional spatial index structures do not work well in this environment because of the need to frequently update the index which results in very poor performance. In this paper, we present a novel indexing structure, namely the Q+Rtree, based on the observation that: i) most moving objects are in quasi-static state most of time, and ii) the moving patterns of objects are strongly related to the topography of the space. The Q+Rtree is a hybrid tree structure which consists of both an R*tree and a QuadTree. The R*tree component indexes quasi-static objects ie., those that are currently moving slowly and are often crowded together in buildings or houses. The Quadtree component indexes fast moving objects which are dispersed over wider regions. We also present the experimental evaluation of our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient index update for moving objects with future trajectories

    Page(s): 183 - 191
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (353 KB) |  | HTML iconHTML  

    Recently, more research has been conducted on moving object databases (MOD). Typically, there are three kinds of data for dynamic attributes in MOD, i.e., historical, current and future. Although many index structures have been developed for the former two types of data, there is not much work to deal with the future data. In particular, the problem of index update has not been addressed with efficient methods. This paper proposes a novel spatio-temporal index based on PMR quadtree, which is called Future Trajectory Quadtree (FT-Quadtree). FT-Quadtree adopts a trajectory segment shared structure and depicts an efficient update algorithm. The performance studies have shown that FT-Quadtree has superiority to the traditional one in index maintenance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prefetching for visual data exploration

    Page(s): 195 - 202
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (559 KB) |  | HTML iconHTML  

    Modern computer applications, from business decision support to scientific data analysis, utilize data visualization tools to support exploratory activities. Visual exploration tools typically do not scale well when applied to huge data sets, partially because being interactive necessitates real-time responses. However, we observe that interactive visual explorations exhibit several properties that can be exploited for data access optimization, including locality of exploration, contiguous queries, and significant delays between user operations. We thus apply semantic caching of active query sets on the client side to exploit some of the above characteristics. We also introduce several prefetching strategies, each exploiting characteristics of our visual exploration environment. We have incorporated caching and prefetching strategies into XmdvTool, a public-domain tool for visual exploration of multivariate data sets. Experimental studies using synthetic as well as real user traces are conducted. Our results demonstrate that these proposed optimization techniques achieve significant performance improvements in our exploratory analysis system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Freshness-driven adaptive caching for dynamic content

    Page(s): 203 - 212
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (643 KB) |  | HTML iconHTML  

    With the wide availability of content delivery networks, many e-commerce Web applications utilize edge cache servers to cache and deliver dynamic contents at locations much closer to users, avoiding network latency. By caching a large number of dynamic content pages in the edge cache servers, response time can be reduced, benefiting from higher cache hit rates. However this is achieved at the expense of higher invalidation cost. On the other hand, a higher invalidation cost leads to a longer invalidation cycle (time to perform the invalidation check on the pages in caches) at the expense of freshness of cached dynamic content. In this paper we propose a freshness-driven adaptive dynamic content caching technique, which monitors response time and invalidation cycle length and dynamically adjusts caching policies. We have implemented the proposed technique within NECs CachePortal Web acceleration solution. The experimental results show that the proposed technique consistently maintains the best content freshness to users. The experimental results also show that even a Web site with dynamic content caching enabled can further benefit from deployment of our solution with improvement of its content freshness up to 10 times especially during heavy traffic. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-stratified sampling for approximate answers to aggregate queries

    Page(s): 215 - 222
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (332 KB) |  | HTML iconHTML  

    In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on samples. However, uniformly extracted samples often do not guarantee acceptable accuracy in grouping interval estimations. This is crucial in most less-aggregated analyses, which are mostly based on recent data (e.g. forecasting, performance analysis). We propose the use of time-interval stratified samples (TISS), a simple sampling strategy that biases towards recency. This improves the accuracy in important less-aggregated analysis without significantly deteriorating aggregated analysis on older data. TISS obtains a much better accuracy than either uniform or the recently proposed congressional samples (CS) for queries analyzing recent data and can be coupled with CS to provide minimal representation guarantees (TISS-CS). We discuss TISS design, the loading process and the query processing middle-layer. We show that TISS is very easily integrated in a data warehouse and works transparently. TISS is evaluated experimentally in a TPC-H setup. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.