By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 7 • Date July 2007

Filter Results

Displaying Results 1 - 14 of 14
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (98 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (83 KB)  
    Freely Available from IEEE
  • The Concentration of Fractional Distances

    Page(s): 873 - 886
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1115 KB) |  | HTML iconHTML  

    Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowski-like norms with an exponent less than one) were introduced to fight the concentration phenomenon. This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample. Furthermore, an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexample is given to prove this claim. Theoretical arguments are presented, which show that the concentration phenomenon can appear for real data that do not match the hypotheses of the theorems, in particular, the assumption of independent and identically distributed variables. Finally, some insights about how to choose an optimal metric are given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancing the Effectiveness of Clustering with Spectra Analysis

    Page(s): 887 - 902
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2423 KB) |  | HTML iconHTML  

    For many clustering algorithms, such as K-Means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters, that is, k, to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images, or biological data. In an effort to improve the effectiveness of clustering, we seek the answer to a fundamental question: How can we effectively estimate the number of clusters in a given data set? We propose an efficient method based on spectra analysis of eigenvalues (not eigenvectors) of the data set as the solution to the above. We first present the relationship between a data set and its underlying spectra with theoretical and experimental results. We then show how our method is capable of suggesting a range of k that is well suited to different analysis contexts. Finally, we conclude with further empirical results to show how the answer to this fundamental question enhances the clustering process for large text collections. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Computation of Iceberg Cubes by Bounding Aggregate Functions

    Page(s): 903 - 918
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2746 KB) |  | HTML iconHTML  

    The iceberg cubing problem is to compute the multidimensional group-by partitions that satisfy given aggregation constraints. Pruning unproductive computation for iceberg cubing when nonantimonotone constraints are present is a great challenge because the aggregate functions do not increase or decrease monotonically along the subset relationship between partitions. In this paper, we propose a novel bound prune cubing (BP-Cubing) approach for iceberg cubing with nonantimonotone aggregation constraints. Given a cube over n dimensions, an aggregate for any group-by partition can be computed from aggregates for the most specific n--dimensional partitions (MSPs). The largest and smallest aggregate values computed this way become the bounds for all partitions in the cube. We provide efficient methods to compute tight bounds for base aggregate functions and, more interestingly, arithmetic expressions thereof, from bounds of aggregates over the MSPs. Our methods produce tighter bounds than those obtained by previous approaches. We present iceberg cubing algorithms that combine bounding with efficient aggregation strategies. Our experiments on real-world and artificial benchmark data sets demonstrate that BP-Cubing algorithms achieve more effective pruning and are several times faster than state-of-the-art iceberg cubing algorithms and that BP-Cubing achieves the best performance with the top-down cubing approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Approximate Query Processing in Peer-to-Peer Networks

    Page(s): 919 - 933
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2366 KB) |  | HTML iconHTML  

    Peer-to-peer (P2P) databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large-scale ad hoc analysis queries, for example, aggregation queries, on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement, given the distributed and dynamic nature of P2P databases. In this paper, we present novel sampling-based techniques for approximate answering of ad hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors: the data is distributed (usually in uneven quantities) across many peers, within each peer, the data is often highly correlated, and, moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SPEX: Streamed and Progressive Evaluation of XPath

    Page(s): 934 - 949
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2178 KB) |  | HTML iconHTML  

    Streams are preferable over data stored in memory in contexts where data is too large or volatile, or a standard approach to data processing based on storing is too time or space consuming. Emerging applications such as publish-subscribe systems, data monitoring in sensor networks, financial and traffic monitoring, and routing of MPEG-7 call for querying streams. In many such applications, XML streams are arguably more appropriate than flat streams, for they convey (possibly unbounded) unranked ordered trees with labeled nodes. However, the flexibility enabled by XML streams in data modeling makes query evaluation different from traditional settings and challenging. This paper describes SPEX, a streamed and progressive evaluation of XML Path Language (XPath). SPEX compiles queries into networks of simple and independent transducers and processes XML streams with polynomial combined complexity. This makes SPEX especially suitable for implementation on devices with low memory and simple logic as used, for example, in mobile computing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Monitoring Algorithm for Fast News Alerts

    Page(s): 950 - 961
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2050 KB) |  | HTML iconHTML  

    Recently, there has been a dramatic increase in the use of XML data to deliver information over the Web. Personal Weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. As the popularity of personal Weblogs and RSS feeds grows rapidly, RSS aggregation services and blog search engines have appeared, which try to provide a central access point for simpler access and discovery of new content from a large number of diverse RSS sources. In this paper, we study how the RSS aggregation services should monitor the data sources to retrieve new content quickly using minimal resources and to provide its subscribers with fast news alerts. We believe that the change characteristics of RSS sources and the general user access behavior pose distinct requirements that make this task significantly different from the traditional index refresh problem for Web search engines. Our studies on a collection of 10,000 RSS feeds reveal some general characteristics of the RSS feeds and show that, with proper resource allocation and scheduling, the RSS aggregator provides news alerts significantly faster than the best existing approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Top-k Monitoring in Wireless Sensor Networks

    Page(s): 962 - 976
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3044 KB) |  | HTML iconHTML  

    Top-k monitoring is important to many wireless sensor applications. This paper exploits the semantics of top-k query and proposes an energy-efficient monitoring approach called FILA. The basic idea is to install a filter at each sensor node to suppress unnecessary sensor updates. Filter setting and query reevaluation upon updates are two fundamental issues to the correctness and efficiency of the FILA approach. We develop a query reevaluation algorithm that is capable of handling concurrent sensor updates. In particular, we present optimization techniques to reduce the probing cost. We design a skewed filter setting scheme, which aims to balance energy consumption and prolong network lifetime. Moreover, two filter update strategies, namely, eager and lazy, are proposed to favor different application scenarios. We also extend the algorithms to several variants of top-k query, that is, order-insensitive, approximate, and value monitoring. The performance of the proposed FILA approach is extensively evaluated using real data traces. The results show that FILA substantially outperforms the existing TAG-based approach and range caching approach in terms of both network lifetime and energy consumption under various network configurations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovering and Exploiting Causal Dependencies for Robust Mobile Context-Aware Recommenders

    Page(s): 977 - 992
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2496 KB) |  | HTML iconHTML  

    Acquisition of context poses unique challenges to mobile context-aware recommender systems. The limited resources in these systems make minimizing their context acquisition a practical need, and the uncertainty in the mobile environment makes missing and erroneous context inputs a major concern. In this paper, we propose an approach based on Bayesian networks (BNs) for building recommender systems that minimize context acquisition. Our learning approach iteratively trims the BN-based context model until it contains only the minimal set of context parameters that are important to a user. In addition, we show that a two-tiered context model can effectively capture the causal dependencies among context parameters, enabling a recommender system to compensate for missing and erroneous context inputs. We have validated our proposed techniques on a restaurant recommendation data set and a Web page recommendation data set. In both benchmark problems, the minimal sets of context can be reliably discovered for the specific users. Furthermore, the learned Bayesian network consistently outperforms the J4.8 decision tree in overcoming both missing and erroneous context inputs to generate significantly more accurate predictions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Note on Linear Time Algorithms for Maximum Error Histograms

    Page(s): 993 - 997
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (762 KB) |  | HTML iconHTML  

    Histograms and Wavelet synopses provide useful tools in query optimization and approximate query answering. Traditional histogram construction algorithms, e.g., V-Optimal, use error measures which are the sums of a suitable function, e.g., square, of the error at each point. Although the best-known algorithms for solving these problems run in quadratic time, a sequence of results have given us a linear time approximation scheme for these algorithms. In recent years, there have been many emerging applications where we are interested in measuring the maximum (absolute or relative) error at a point. We show that this problem is fundamentally different from the other traditional {rm{non}}{hbox{-}}ell_infty error measures and provide an optimal algorithm that runs in linear time for a small number of buckets. We also present results which work for arbitrary weighted maximum error measures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Join the IEEE Computer Society!

    Save to Project icon | Request Permissions | PDF file iconPDF (340 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (83 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (98 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University