By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 5 • Date May 2011

Filter Results

Displaying Results 1 - 16 of 16
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (194 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (240 KB)  
    Freely Available from IEEE
  • Authenticated Multistep Nearest Neighbor Search

    Page(s): 641 - 654
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2088 KB) |  | HTML iconHTML  

    Multistep processing is commonly used for nearest neighbor (NN) and similarity search in applications involving high-dimensional data and/or costly distance computations. Today, many such applications require a proof of result correctness. In this setting, clients issue NN queries to a server that maintains a database signed by a trusted authority. The server returns the NN set along with supplementary information that permits result verification using the data set signature. An adaptation of the multistep NN algorithm incurs prohibitive network overhead due to the transmission of false hits, i.e., records that are not in the NN set, but are nevertheless necessary for its verification. In order to alleviate this problem, we present a novel technique that reduces the size of each false hit. Moreover, we generalize our solution for a distributed setting, where the database is horizontally partitioned over several servers. Finally, we demonstrate the effectiveness of the proposed solutions with real data sets of various dimensionalities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Branch-and-Bound for Model Selection and Its Computational Complexity

    Page(s): 655 - 668
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2655 KB) |  | HTML iconHTML  

    Branch-and-bound methods are used in various data analysis problems, such as clustering, seriation and feature selection. Classical approaches of branch-and-bound based clustering search through combinations of various partitioning possibilities to optimize a clustering cost. However, these approaches are not practically useful for clustering of image data where the size of data is large. Additionally, the number of clusters is unknown in most of the image data analysis problems. By taking advantage of the spatial coherency of clusters, we formulate an innovative branch-and-bound approach, which solves clustering problem as a model-selection problem. In this generalized approach, cluster parameter candidates are first generated by spatially coherent sampling. A branch-and-bound search is carried out through the candidates to select an optimal subset. This paper formulates this approach and investigates its average computational complexity. Improved clustering quality and robustness to outliers compared to conventional iterative approach are demonstrated with experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme

    Page(s): 669 - 682
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1460 KB) |  | HTML iconHTML  

    E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spams. On purpose of achieving efficient similarity matching and reducing storage utilization, prior works mainly represent each e-mail by a succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this paper, we propose a novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is applicable to the real world. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovering Conditional Functional Dependencies

    Page(s): 683 - 698
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1238 KB) |  | HTML iconHTML  

    This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Intertemporal Discount Factors as a Measure of Trustworthiness in Electronic Commerce

    Page(s): 699 - 712
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (505 KB) |  | HTML iconHTML  

    In multiagent interactions, such as e-commerce and file sharing, being able to accurately assess the trustworthiness of others is important for agents to protect themselves from losing utility. Focusing on rational agents in e-commerce, we prove that an agent's discount factor (time preference of utility) is a direct measure of the agent's trustworthiness for a set of reasonably general assumptions and definitions. We propose a general list of desiderata for trust systems and discuss how discount factors as trustworthiness meet these desiderata. We discuss how discount factors are a robust measure when entering commitments that exhibit moral hazards. Using an online market as a motivating example, we derive some analytical methods both for measuring discount factors and for aggregating the measurements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining Discriminative Patterns for Classifying Trajectories on Road Networks

    Page(s): 713 - 726
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (973 KB) |  | HTML iconHTML  

    Classification has been used for modeling many kinds of data sets, including sets of items, text documents, graphs, and networks. However, there is a lack of study on a new kind of data, trajectories on road networks. Modeling such data is useful with the emerging GPS and RFID technologies and is important for effective transportation and traffic planning. In this work, we study methods for classifying trajectories on road networks. By analyzing the behavior of trajectories on road networks, we observe that, in addition to the locations where vehicles have visited, the order of these visited locations is crucial for improving classification accuracy. Based on our analysis, we contend that (frequent) sequential patterns are good feature candidates since they preserve this order information. Furthermore, when mining sequential patterns, we propose to confine the length of sequential patterns to ensure high efficiency. Compared with closed sequential patterns, these partial (i.e., length-confined) sequential patterns allow us to significantly improve efficiency almost without losing accuracy. In this paper, we present a framework for frequent pattern-based classification for trajectories on road networks. Our comparative study over a broad range of classification approaches demonstrates that our method significantly improves accuracy over other methods in some synthetic and real trajectory data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pareto-Based Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries

    Page(s): 727 - 741
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2568 KB) |  | HTML iconHTML  

    Given a record set D and a query score function F, a top-k query returns k records from D, whose values of function F on their attributes are the highest. In this paper, we investigate the intrinsic connection between top-k queries and dominant relationships between records, and based on which, we propose an efficient layer-based indexing structure, Pareto-Based Dominant Graph (DG), to improve the query efficiency. Specifically, DG is built offline to express the dominant relationship between records and top-k query is implemented as a graph traversal problem, i.e., Traveler algorithm. We prove theoretically that the size of search space (that is the number of retrieved records from the record set to answer top-k query) in our algorithm is directly related to the cardinality of skyline points in the record set (see Theorem 3). Considering I/O cost, we propose cluster-based storage schema to reduce I/O cost in Traveler algorithm. We also propose the cost estimation methods in this paper. Based on cost analysis, we propose an optimization technique, pseudorecord, to further improve the search efficiency. In order to handle the top-k query in the high-dimension record set, we also propose N-Way Traveler algorithm. In order to handle DG maintenance efficiently, we propose “Insertion” and “Deletion” algorithms for DG. Finally, extensive experiments demonstrate that our proposed methods have significant improvement over its counterparts, including both classical and state art of top-k algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RFID Data Processing in Supply Chain Management Using a Path Encoding Scheme

    Page(s): 742 - 758
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5631 KB) |  | HTML iconHTML  

    RFID technology can be applied to a broad range of areas. In particular, RFID is very useful in the area of business, such as supply chain management. However, the amount of RFID data in such an environment is huge. Therefore, much time is needed to extract valuable information from RFID data for supply chain management. In this paper, we present an efficient method to process a massive amount of RFID data for supply chain management. We first define query templates to analyze the supply chain. We then propose an effective path encoding scheme that encodes the flows of products. However, if the flows are long, the numbers in the path encoding scheme that correspond to the flows will be very large. We solve this by providing a method that divides flows. To retrieve the time information for products efficiently, we utilize a numbering scheme for the XML area. Based on the path encoding scheme and the numbering scheme, we devise a storage scheme that can process tracking queries and path oriented queries efficiently on an RDBMS. Finally, we propose a method that translates the queries to SQL queries. Experimental results show that our approach can process the queries efficiently. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic Knowledge-Based Framework to Improve the Situation Awareness of Autonomous Underwater Vehicles

    Page(s): 759 - 773
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1176 KB) |  | HTML iconHTML  

    This paper proposes a semantic world model framework for hierarchical distributed representation of knowledge in autonomous underwater systems. This framework aims to provide a more capable and holistic system, involving semantic interoperability among all involved information sources. This will enhance interoperability, independence of operation, and situation awareness of the embedded service-oriented agents for autonomous platforms. The results obtained specifically affect the mission flexibility, robustness, and autonomy. The presented framework makes use of the idea that heterogeneous real-world data of very different type must be processed by (and run through) several different layers, to be finally available in a suited format and at the right place to be accessible by high-level decision-making agents. In this sense, the presented approach shows how to abstract away from the raw real-world data step by step by means of semantic technologies. The paper concludes by demonstrating the benefits of the framework in a real scenario. A hardware fault is simulated in a REMUS 100 AUV while performing a mission. This triggers a knowledge exchange between the status monitoring agent and the adaptive mission planner embedded agent. By using the proposed framework, both services can interchange information while remaining domain independent during their interaction with the platform. The results of this paper are readily applicable to land and air robotics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SwiftRule: Mining Comprehensible Classification Rules for Time Series Analysis

    Page(s): 774 - 787
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2362 KB) |  | HTML iconHTML  

    In this article, we provide a new technique for temporal data mining which is based on classification rules that can easily be understood by human domain experts. Basically, time series are decomposed into short segments, and short-term trends of the time series within the segments (e.g., average, slope, and curvature) are described by means of polynomial models. Then, the classifiers assess short sequences of trends in subsequent segments with their rule premises. The conclusions gradually assign an input to a class. As the classifier is a generative model of the processes from which the time series are assumed to originate, anomalies can be detected, too. Segmentation and piecewise polynomial modeling are done extremely fast in only one pass over the time series. Thus, the approach is applicable to problems with harsh timing constraints. We lay the theoretical foundations for this classifier, including a new distance measure for time series and a new technique to construct a dynamic classifier from a static one, and demonstrate its properties by means of various benchmark time series, for example, Lorenz attractor time series, energy consumption in a building, or ECG data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • When Does Cotraining Work in Real Data?

    Page(s): 788 - 799
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1029 KB) |  | HTML iconHTML  

    Cotraining, a paradigm of semisupervised learning, is promised to alleviate effectively the shortage of labeled examples in supervised learning. The standard two-view cotraining requires the data set to be described by two views of features, and previous studies have shown that cotraining works well if the two views satisfy the sufficiency and independence assumptions. In practice, however, these two assumptions are often not known or ensured (even when the two views are given). More commonly, most supervised data sets are described by one set of attributes (one view). Thus, they need be split into two views in order to apply the standard two-view cotraining. In this paper, we first propose a novel approach to empirically verify the two assumptions of cotraining given two views. Then, we design several methods to split single view data sets into two views, in order to make cotraining work reliably well. Our empirical results show that, given a whole or a large labeled training set, our view verification and splitting methods are quite effective. Unfortunately, cotraining is called for precisely when the labeled training set is small. However, given small labeled training sets, we show that the two cotraining assumptions are difficult to verify, and view splitting is unreliable. Our conclusions for cotraining's effectiveness are mixed. If two views are given, and known to satisfy the two assumptions, cotraining works well. Otherwise, based on small labeled training sets, verifying the assumptions or splitting single view into two views are unreliable; thus, it is uncertain whether the standard cotraining would work or not. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE and IEEE Computer Society Special Student Offer

    Page(s): 800
    Save to Project icon | Request Permissions | PDF file iconPDF (468 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (240 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (194 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University