By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 1 • Date Jan. 2010

Filter Results

Displaying Results 1 - 22 of 22
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (152 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (209 KB)  
    Freely Available from IEEE
  • State of the Transactions Editorial

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (43 KB)  
    Freely Available from IEEE
  • Anonymous Query Processing in Road Networks

    Page(s): 2 - 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2753 KB) |  | HTML iconHTML  

    The increasing availability of location-aware mobile devices has given rise to a flurry of location-based services (LBSs). Due to the nature of spatial queries, an LBS needs the user position in order to process her requests. On the other hand, revealing exact user locations to a (potentially untrusted) LBS may pinpoint their identities and breach their privacy. To address this issue, spatial anonymity techniques obfuscate user locations, forwarding to the LBS a sufficiently large region instead. Existing methods explicitly target processing in the euclidean space and do not apply when proximity to the users is defined according to network distance (e.g., driving time through the roads of a city). In this paper, we propose a framework for anonymous query processing in road networks. We design location obfuscation techniques that: (1) provide anonymous LBS access to the users and (2) allow efficient query processing at the LBS side. Our techniques exploit existing network database infrastructure, requiring no specialized storage schemes or functionalities. We experimentally compare alternative designs in real road networks and demonstrate the effectiveness of our techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Density Conscious Subspace Clustering for High-Dimensional Data

    Page(s): 16 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2946 KB) |  | HTML iconHTML  

    Instead of finding clusters in the full feature space, subspace clustering is an emergent task which aims at detecting clusters embedded in subspaces. Most of previous works in the literature are density-based approaches, where a cluster is regarded as a high-density region in a subspace. However, the identification of dense regions in previous works lacks of considering a critical problem, called "the density divergence problemrdquo in this paper, which refers to the phenomenon that the region densities vary in different subspace cardinalities. Without considering this problem, previous works utilize a density threshold to discover the dense regions in all subspaces, which incurs the serious loss of clustering accuracy (either recall or precision of the resulting clusters) in different subspace cardinalities. To tackle the density divergence problem, in this paper, we devise a novel subspace clustering model to discover the clusters based on the relative region densities in the subspaces, where the clusters are regarded as regions whose densities are relatively high as compared to the region densities in a subspace. Based on this idea, different density thresholds are adaptively determined to discover the clusters in different subspace cardinalities. Due to the infeasibility of applying previous techniques in this novel clustering model, we also devise an innovative algorithm, referred to as DENCOS (density conscious subspace clustering), to adopt a divide-and-conquer scheme to efficiently discover clusters satisfying different density thresholds in different subspace cardinalities. As validated by our extensive experiments on various data sets, DENCOS can discover the clusters in all subspaces with high quality, and the efficiency of DENCOS outperformes previous works. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Development of a Bayesian Framework for Determining Uncertainty in Receiver Operating Characteristic Curve Estimates

    Page(s): 31 - 45
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1789 KB) |  | HTML iconHTML  

    This research uses a Bayesian framework to develop probability densities for the receiver operating characteristic (ROC) curve. The ROC curve is a discrimination metric that may be used to quantify how well a detection system classifies targets and nontargets. The degree of uncertainty in ROC curve formulation is a concern that previous research has not adequately addressed. This research formulates a probability density for the ROC curve and characterizes its uncertainty using confidence bands. Methods for the generation and characterization of the probability densities of the ROC curve are specified and demonstrated, where the initial analysis employs beta densities to model target and nontarget samples of detection system output. For given target and nontarget data, given functional forms of the data densities (such as beta density forms) and given prior densities of the form parameters, the methods developed here provide exact performance metric probability densities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning with Positive and Unlabeled Examples Using Topic-Sensitive PLSA

    Page(s): 46 - 58
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2671 KB) |  | HTML iconHTML  

    It is often difficult and time-consuming to provide a large amount of positive and negative examples for training a classification system in many applications such as information retrieval. Instead, users often find it easier to indicate just a few positive examples of what he or she likes, and thus, these are the only labeled examples available for the learning system. A large amount of unlabeled data are easier to obtain. How to make use of the positive and unlabeled data for learning is a critical problem in machine learning and information retrieval. Several approaches for solving this problem have been proposed in the past, but most of these methods do not work well when only a small amount of labeled positive data are available. In this paper, we propose a novel algorithm called Topic-Sensitive pLSA to solve this problem. This algorithm extends the original probabilistic latent semantic analysis (pLSA), which is a purely unsupervised framework, by injecting a small amount of supervision information from the user. The supervision from users is in the form of indicating which documents fit the users' interests. The supervision is encoded into a set of constraints. By introducing the penalty terms for these constraints, we propose an objective function that trades off the likelihood of the observed data and the enforcement of the constraints. We develop an iterative algorithm that can obtain the local optimum of the objective function. Experimental evaluation on three data corpora shows that the proposed method can improve the performance especially only with a small amount of labeled positive data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LIGHT: A Query-Efficient Yet Low-Maintenance Indexing Scheme over DHTs

    Page(s): 59 - 75
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2779 KB) |  | HTML iconHTML  

    DHT is a widely used building block for scalable P2P systems. However, as uniform hashing employed in DHTs destroys data locality, it is not a trivial task to support complex queries (e.g., range queries and k-nearest-neighbor queries) in DHT-based P2P systems. In order to support efficient processing of such complex queries, a popular solution is to build indexes on top of the DHT. Unfortunately, existing over-DHT indexing schemes suffer from either query inefficiency or high maintenance cost. In this paper, we propose LIGhtweight Hash Tree (LIGHT)-a query-efficient yet low-maintenance indexing scheme. LIGHT employs a novel naming mechanism and a tree summarization strategy for graceful distribution of its index structure. We show through analysis that it can support various complex queries with near-optimal performance. Extensive experimental results also demonstrate that, compared with state of the art over-DHT indexing schemes, LIGHT saves 50-75 percent of index maintenance cost and substantially improves query performance in terms of both response time and bandwidth consumption. In addition, LIGHT is designed over generic DHTs and hence can be easily implemented and deployed in any DHT-based P2P system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MILD: Multiple-Instance Learning via Disambiguation

    Page(s): 76 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2301 KB) |  | HTML iconHTML  

    In multiple-instance learning (MIL), an individual example is called an instance and a bag contains a single or multiple instances. The class labels available in the training set are associated with bags rather than instances. A bag is labeled positive if at least one of its instances is positive; otherwise, the bag is labeled negative. Since a positive bag may contain some negative instances in addition to one or more positive instances, the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and, consequently, the instance labels are inherently ambiguous. In this paper, we propose a very efficient and robust MIL method, called Multiple-Instance Learning via Disambiguation (MILD), for general MIL problems. First, we propose a novel disambiguation method to identify the true positive instances in the positive bags. Second, we propose two feature representation schemes, one for instance-level classification and the other for bag-level classification, to convert the MIL problem into a standard single-instance learning (SIL) problem that can be solved by well-known SIL algorithms, such as support vector machine. Third, an inductive semi-supervised learning method is proposed for MIL. We evaluate our methods extensively on several challenging MIL applications to demonstrate their promising efficiency, robustness, and accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling Massive RFID Data Sets: A Gateway-Based Movement Graph Approach

    Page(s): 90 - 104
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1852 KB) |  | HTML iconHTML  

    Massive radio frequency identification (RFID) data sets are expected to become commonplace in supply chain management systems. Warehousing and mining this data is an essential problem with great potential benefits for inventory management, object tracking, and product procurement processes. Since RFID tags can be used to identify each individual item, enormous amounts of location-tracking data are generated. With such data, object movements can be modeled by movement graphs, where nodes correspond to locations and edges record the history of item transitions between locations. In this study, we develop a movement graph model as a compact representation of RFID data sets. Since spatiotemporal as well as item information can be associated with the objects in such a model, the movement graph can be huge, complex, and multidimensional in nature. We show that such a graph can be better organized around gateway nodes, which serve as bridges connecting different regions of the movement graph. A graph-based object movement cube can be constructed by merging and collapsing nodes and edges according to an application-oriented topological structure. Moreover, we propose an efficient cubing algorithm that performs simultaneous aggregation of both spatiotemporal and item dimensions on a partitioned movement graph, guided by such a topological structure. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Object and Combination Shedding Schemes for Adaptive Media Workflow Execution

    Page(s): 105 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3171 KB) |  | HTML iconHTML  

    Complex media fusion operations can be costly in terms of the time they need to process input objects. If data arrive faster to fusion nodes than the speed with which they can consume the inputs, this will result in some input objects not being processed. In this paper, we develop load shedding mechanisms which take into consideration both data quality and expensive nature of media fusion operators. In particular, we present quality assessment models for objects and multistream fusion operators and highlight that such quality assessments may impose partial orders on objects. We highlight that the most effective load control approach for fusion operators involves shedding of (not the individual input objects but) combinations of objects. Yet, identifying suitable combinations of objects in real time will not be possible if efficient combination selection algorithms do not exist. We develop efficient combination selection schemes for scenarios with different quality assessment and target characteristics. We first develop efficient combination-based load shedding when the fusion operator has unambiguously monotone semantics. We then extend this to the more general ambiguously monotone case and present experimental results that show the performance gains using quality-aware combination-based load shedding strategies under the various fusion scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Dynamic Bloom Filters

    Page(s): 120 - 133
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2173 KB) |  | HTML iconHTML  

    A Bloom filter is an effective, space-efficient data structure for concisely representing a set, and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets, as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in stand-alone applications, as well as distributed applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Aging Bloom Filter with Two Active Buffers for Dynamic Sets

    Page(s): 134 - 138
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1514 KB) |  | HTML iconHTML  

    A Bloom filter is a simple but powerful data structure that can check membership to a static set. As Bloom filters become more popular for network applications, a membership query for a dynamic set is also required. Some network applications require high-speed processing of packets. For this purpose, Bloom filters should reside in a fast and small memory, SRAM. In this case, due to the limited memory size, stale data in the Bloom filter should be deleted to make space for new data. Namely the Bloom filter needs aging like LRU caching. In this paper, we propose a new aging scheme for Bloom filters. The proposed scheme utilizes the memory space more efficiently than double buffering, the current state of the art. We prove theoretically that the proposed scheme outperforms double buffering. We also perform experiments on real Internet traces to verify the effectiveness of the proposed scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayesian Classifiers Programmed in SQL

    Page(s): 139 - 144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (807 KB) |  | HTML iconHTML  

    The Bayesian classifier is a fundamental classification technique. In this work, we focus on programming Bayesian classifiers in SQL. We introduce two classifiers: naive Bayes and a classifier based on class decomposition using K-means clustering. We consider two complementary tasks: model computation and scoring a data set. We study several layouts for tables and several indexing alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and synthetic data sets to evaluate classification accuracy, query optimizations, and scalability. Our Bayesian classifier is more accurate than naive Bayes and decision trees. Distance computation is significantly accelerated with horizontal layout for tables, denormalization, and pivoting. We also compare naive Bayes implementations in SQL and C++: SQL is about four times slower. Our Bayesian classifier in SQL achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deterministic Column-Based Matrix Decomposition

    Page(s): 145 - 149
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (733 KB) |  | HTML iconHTML  

    In this paper, we propose a deterministic column-based matrix decomposition method. Conventional column-based matrix decomposition (CX) computes the columns by randomly sampling columns of the data matrix. Instead, the newly proposed method (termed as CX_D) selects columns in a deterministic manner, which well approximates singular value decomposition. The experimental results well demonstrate the power and the advantages of the proposed method upon three real-world data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2009 Reviewers List

    Page(s): 150 - 156
    Save to Project icon | Request Permissions | PDF file iconPDF (101 KB)  
    Freely Available from IEEE
  • Call for Papers for New IEEE Transactions on Affective Computing

    Page(s): 157
    Save to Project icon | Request Permissions | PDF file iconPDF (145 KB)  
    Freely Available from IEEE
  • 7 Great Reasons for Joining the IEEE Computer Society [advertisement]

    Page(s): 158
    Save to Project icon | Request Permissions | PDF file iconPDF (605 KB)  
    Freely Available from IEEE
  • IEEE Computer Society Computing Now [advertisement]

    Page(s): 159
    Save to Project icon | Request Permissions | PDF file iconPDF (84 KB)  
    Freely Available from IEEE
  • IEEE and IEEE Computer Society 2010 Student Member Package

    Page(s): 160
    Save to Project icon | Request Permissions | PDF file iconPDF (151 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (209 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (152 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University