By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 5 • Date May 2009

Filter Results

Displaying Results 1 - 16 of 16
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (141 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (78 KB)  
    Freely Available from IEEE
  • A Survey of Uncertain Data Algorithms and Applications

    Page(s): 609 - 623
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (641 KB) |  | HTML iconHTML  

    In recent years, a number of indirect data collection methodologies have lead to the proliferation of uncertain data. Such data points are often represented in the form of a probabilistic function, since the corresponding deterministic value is not known. This increases the challenge of mining and managing uncertain data, since the precise behavior of the underlying data is no longer known. In this paper, we provide a survey of uncertain data mining and management applications. In the field of uncertain data management, we will examine traditional methods such as join processing, query processing, selectivity estimation, OLAP queries, and indexing. In the field of uncertain data mining, we will examine traditional mining problems such as classification and clustering. We will also examine a general transform based technique for mining uncertain data. We discuss the models for uncertain data, and how they can be leveraged in a variety of applications. We discuss different methodologies to process and mine uncertain data in a variety of forms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adapted One-versus-All Decision Trees for Data Stream Classification

    Page(s): 624 - 637
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4172 KB) |  | HTML iconHTML  

    One versus all (OVA) decision trees learn k individual binary classifiers, each one to distinguish the instances of a single class from the instances of all other classes. Thus OVA is different from existing data stream classification schemes whose majority use multiclass classifiers, each one to discriminate among all the classes. This paper advocates some outstanding advantages of OVA for data stream classification. First, there is low error correlation and hence high diversity among OVA's component classifiers, which leads to high classification accuracy. Second, OVA is adept at accommodating new class labels that often appear in data streams. However, there also remain many challenges to deploy traditional OVA for classifying data streams. First, as every instance is fed to all component classifiers, OVA is known as an inefficient model. Second, OVA's classification accuracy is adversely affected by the imbalanced class distribution in data streams. This paper addresses those key challenges and consequently proposes a new OVA scheme that is adapted for data stream classification. Theoretical analysis and empirical evidence reveal that the adapted OVA can offer faster training, faster updating and higher classification accuracy than many existing popular data stream classification algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayes Vector Quantizer for Class-Imbalance Problem

    Page(s): 638 - 651
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4360 KB) |  | HTML iconHTML  

    The class-imbalance problem is the problem of learning a classification rule from data that are skewed in favor of one class. On these datasets traditional learning techniques tend to overlook the less numerous class, at the advantage of the majority class. However, the minority class is often the most interesting one for the task at hand. For this reason, the class-imbalance problem has received increasing attention in the last few years. In the present paper we point the attention of the reader to a learning algorithm for the minimization of the average misclassification risk. In contrast to some popular class-imbalance learning methods, this method has its roots in statistical decision theory. A particular interesting characteristic is that when class distributions are unknown, the method can work by resorting to stochastic gradient algorithm. We study the behavior of this algorithm on imbalanced datasets, demonstrating that this principled approach allows to obtain better classification performances compared to the principal methods proposed in the literature. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data

    Page(s): 652 - 665
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2934 KB) |  | HTML iconHTML  

    Sampling has been recognized as an important technique to improve the efficiency of clustering. However, with sampling applied, those points that are not sampled will not have their labels after the normal process. Although there is a straightforward approach in the numerical domain, the problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain. In this paper, a mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is proposed to allocate each unlabeled data point into the corresponding appropriate cluster based on the novel categorical clustering representative, namely, N-Nodeset Importance Representative (abbreviated as NNIR), which represents clusters by the importance of the combinations of attribute values. MARDL has two advantages: (1) MARDL exhibits high execution efficiency, and (2) MARDL can achieve high intracluster similarity and low intercluster similarity, which are regarded as the most important properties of clusters, thus benefiting the analysis of cluster behaviors. MARDL is empirically validated on real and synthetic data sets and is shown to be significantly more efficient than prior works while attaining results of high quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GridVideo: A Practical Example of Nonscientific Application on the Grid

    Page(s): 666 - 680
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2242 KB) |  | HTML iconHTML  

    Starting from 1990s and until now, Grid computing has been mainly used in scientific laboratories. Only in the last few years, it is evolving into a business-innovating technology that is driving commercial adoption. In this paper, we describe GridVideo, a Grid-based multimedia application for the distributed tailoring and streaming of media files. The objective is to show, starting from a real experience, how Grid technologies can be used for the development of nonscientific applications. Relevant performance aspects are analyzed, regarding both user-oriented (in terms of responsiveness) and provider-oriented (in terms of system efficiency) requirements. Different multimedia data dissemination strategies have been analyzed and an innovative technique, based on the Fibonacci series, is proposed. To respond to the stringent quality-of-service (QoS) requirements, typical of soft real-time applications, a reservation-based architecture is presented. Such architecture is able to manage the Grid resource allocation, thus enabling the provisioning of advanced services with different QoS levels. Technical and practical problems encountered during the development are discussed, and a thorough performance evaluation of the developed prototype is presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization

    Page(s): 681 - 698
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3539 KB) |  | HTML iconHTML  

    In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88% accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exact Knowledge Hiding through Database Extension

    Page(s): 699 - 713
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1949 KB) |  | HTML iconHTML  

    In this paper, we propose a novel, exact border-based approach that provides an optimal solution for the hiding of sensitive frequent itemsets by (i) minimally extending the original database by a synthetically generated database part - the database extension, (ii) formulating the creation of the database extension as a constraint satisfaction problem, (iii) mapping the constraint satisfaction problem to an equivalent binary integer programming problem, (iv) exploiting underutilized synthetic transactions to proportionally increase the support of non-sensitive itemsets, (v) minimally relaxing the constraint satisfaction problem to provide an approximate solution close to the optimal one when an ideal solution does not exist, and (vi) by using a partitioning in the universe of the items to increase the efficiency of the proposed hiding algorithm. Extending the original database for sensitive itemset hiding is proved to provide optimal solutions to an extended set of hiding problems compared to previous approaches and to provide solutions of higher quality. Moreover, the application of binary integer programming enables the simultaneous hiding of the sensitive itemsets and thus allows for the identification of globally optimal solutions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GLIP: A Concurrency Control Protocol for Clipping Indexing

    Page(s): 714 - 728
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4172 KB) |  | HTML iconHTML  

    Multidimensional databases are beginning to be used in a wide range of applications. To meet this fast-growing demand, the R-tree family is being applied to support fast access to multidimensional data, for which the R+-tree exhibits outstanding search performance. In order to support efficient concurrent access in multiuser environments, concurrency control mechanisms for multidimensional indexing have been proposed. However, these mechanisms cannot be directly applied to the R+-tree because an object in the R+-tree may be indexed in multiple leaves. This paper proposes a concurrency control protocol for R-tree variants with object clipping, namely, Granular Locking for clipping indexing (GLIP). GLIP is the first concurrency control approach specifically designed for the R+-tree and its variants, and it supports efficient concurrent operations with serializable isolation, consistency, and deadlock-free. Experimental tests on both real and synthetic data sets validated the effectiveness and efficiency of the proposed concurrent access framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Query Point Movement Techniques for Large CBIR Systems

    Page(s): 729 - 743
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3444 KB) |  | HTML iconHTML  

    Target search in content-based image retrieval (CBIR) systems refers to finding a specific (target) image such as a particular registered logo or a specific historical photograph. Existing techniques, designed around query refinement based on relevance feedback, suffer from slow convergence, and do not guarantee to find intended targets. To address these limitations, we propose several efficient query point movement methods. We prove that our approach is able to reach any given target image with fewer iterations in the worst and average cases. We propose a new index structure and query processing technique to improve retrieval effectiveness and efficiency. We also consider strategies to minimize the effects of users' inaccurate relevance feedback. Extensive experiments in simulated and realistic environments show that our approach significantly reduces the number of required iterations and improves overall retrieval performance. The experimental results also confirm that our approach can always retrieve intended targets even with poor selection of initial query points. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Schema Vacuuming in Temporal Databases

    Page(s): 744 - 747
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (684 KB) |  | HTML iconHTML  

    Temporal databases facilitate the support of historical information by providing functions for indicating the intervals during which a tuple was applicable (along one or more temporal dimensions). Because data are never deleted, only superceded, temporal databases are inherently append-only resulting, over time, in a large historical sequence of database states. Data vacuuming in temporal databases allows for this sequence to be shortened by strategically, and irrevocably, deleting obsolete data. Schema versioning allows users to maintain a history of database schemata without compromising the semantics of the data or the ability to view data through historical schemata. While the techniques required for data vacuuming in temporal databases have been relatively well covered, the associated area of vacuuming schemata has received less attention. This paper discusses this issue and proposes a mechanism that fits well with existing methods for data vacuuming and schema versioning. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Subgraph Similarity Problem

    Page(s): 748 - 749
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (126 KB) |  | HTML iconHTML  

    Similarity is a well known weakening of bisimilarity where one system is required to simulate the other and vice versa. It has been shown that the subgraph bisimilarity problem, a variation of the subgraph isomorphism problem where isomorphism is weakened to bisimilarity, is NP-complete. We show that the subgraph similarity problem and some related variations thereof still remain NP-complete. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Computer Society 2009 Membership Application

    Page(s): 750 - 752
    Save to Project icon | Request Permissions | PDF file iconPDF (226 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (78 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (141 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University