By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 2 • Date Feb. 2012

Filter Results

Displaying Results 1 - 20 of 20
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (116 KB)  
    Freely Available from IEEE
  • [Cover 2]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (156 KB)  
    Freely Available from IEEE
  • State of the Journal

    Page(s): 193 - 196
    Save to Project icon | Request Permissions | PDF file iconPDF (205 KB)  
    Freely Available from IEEE
  • A Multidimensional Sequence Approach to Measuring Tree Similarity

    Page(s): 197 - 208
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1518 KB) |  | HTML iconHTML  

    Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. Tree kernel, a state of the art similarity measurement of trees, represents trees as vectors in a feature space and measures similarity in this space. When different features are used, different algorithms are required. Tree edit distance is another widely used similarity measurement of trees. It measures similarity through edit operations needed to transform one tree to another. Without any restrictions on edit operations, the computation cost is too high to be applicable to large volume of data. To improve efficiency of tree edit distance, some approximations were introduced into tree edit distance. However, their effectiveness can be compromised. In this paper, a novel approach to measuring tree similarity is presented. Trees are represented as multidimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or the longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate the similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating the similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in k-nearest neighbor similarity search, using three differ- nt data sets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Agglomerative Mean-Shift Clustering

    Page(s): 209 - 219
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2542 KB) |  | HTML iconHTML  

    Mean-Shift (MS) is a powerful nonparametric clustering method. Although good accuracy can be achieved, its computational cost is particularly expensive even on moderate data sets. In this paper, for the purpose of algorithmic speedup, we develop an agglomerative MS clustering method along with its performance analysis. Our method, namely Agglo-MS, is built upon an iterative query set compression mechanism which is motivated by the quadratic bounding optimization nature of MS algorithm. The whole framework can be efficiently implemented in linear running time complexity. We then extend Agglo-MS into an incremental version which performs comparably to its batch counterpart. The efficiency and accuracy of Agglo-MS are demonstrated by extensive comparing experiments on synthetic and real data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Answering General Time-Sensitive Queries

    Page(s): 220 - 235
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1411 KB)  

    Time is an important dimension of relevance for a large number of searches, such as over blogs and news archives. So far, research on searching over such collections has largely focused on locating topically similar documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the documents in a news archive is important and should be considered in conjunction with the topic similarity to derive the final document ranking. Earlier work has focused on improving retrieval for “recency” queries that target recent documents. We propose a more general framework for handling time-sensitive queries and we automatically identify the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that seamlessly integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimental evaluation using a variety of news article data sets, including TREC data as well as real web data analyzed using the Amazon Mechanical Turk. We examine several techniques for detecting the important time intervals for a query over a news archive and for incorporating this information in the retrieval process. We show that our techniques are robust and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BibPro: A Citation Parser Based on Sequence Alignment

    Page(s): 236 - 250
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2418 KB) |  | HTML iconHTML  

    Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the resources to meet researchers' needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark. Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a reasonable performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discover Dependencies from Data—A Review

    Page(s): 251 - 264
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1268 KB) |  | HTML iconHTML  

    Functional and inclusion dependency discovery is important to knowledge discovery, database semantics analysis, database design, and data quality assessment. Motivated by the importance of dependency discovery, this paper reviews the methods for functional dependency, conditional functional dependency, approximate functional dependency, and inclusion dependency discovery in relational databases and a method for discovering XML functional dependencies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective and Efficient Shape-Based Pattern Detection over Streaming Time Series

    Page(s): 265 - 278
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1376 KB) |  | HTML iconHTML  

    Existing distance measures of time series such as the euclidean distance, DTW, and EDR are inadequate in handling certain degrees of amplitude shifting and scaling variances of data items. We propose a novel distance measure of time series, Spatial Assembling Distance (SpADe), that is able to handle noisy, shifting, and scaling in both temporal and amplitude dimensions. We further apply the SpADe to the application of streaming pattern detection, which is very useful in trend-related analysis, sensor networks, and video surveillance. Our experimental results on real time series data sets show that SpADe is an effective distance measure of time series. Moreover, high accuracy and efficiency are achieved by SpADe for continuous pattern detection in streaming time series. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data

    Page(s): 279 - 294
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1107 KB) |  | HTML iconHTML  

    Discriminative patterns can provide valuable insights into data sets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional data sets. However, for dense and high-dimensional data sets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such data sets. We propose a family of antimonotonic measures named SupMaxKthat organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional data sets. Experiments on both synthetic data sets and a cancer gene expression data set demonstrate that there are low-support patterns that can be discovered using SupMaxPair but not by existing approaches. Furthermore, we show that the low-support discriminative patterns that are only discovered using SupMaxPair from the cancer gene expression data set are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPairXo existing approaches for discriminative pattern discovery. The codes and data set for this paper are available at http://vk.cs.umn.edu/SMP/. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Group Nearest Group Query Processing

    Page(s): 295 - 308
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1426 KB) |  | HTML iconHTML  

    Given a data point set D, a query point set Q, and an integer k, the Group Nearest Group (GNG) query finds a subset ω (|ω| ≤ k)of points from Dsuch that the total distance from all points in Q to the nearest point in ω is not greater than any other subset ω' (|ω'| ≤ k) of points in D. GNG query is a partition-based clustering problem which can be found in many real applications and is NP-hard. In this paper, Exhaustive Hierarchical Combination (EHC) algorithm and Subset Hierarchial Refinement (SHR) algorithm are developed for GNG query processing. While EHC is capable to provide the optimal solution for k = 2, SHR is an efficient approximate approach that combines database techniques with local search heuristic. The processing focus of our approaches is on minimizing the access and evaluation of subsets of cardinality k in D since the number of such subsets is exponentially greater than |D|. To do that, the hierarchical blocks of data points at high level are used to find an intermediate solution and then refined by following the guided search direction at low level so as to prune irrelevant subsets. The comprehensive experiments on both real and synthetic data sets demonstrate the superiority of SHR in terms of efficiency and quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Deep Order-Preserving Submatrix Problem: A Best Effort Approach

    Page(s): 309 - 325
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1769 KB) |  | HTML iconHTML  

    Order-preserving submatrix (OPSM) has been widely accepted as a biologically meaningful cluster model, capturing the general tendency of gene expression across a subset of experiments. In an OPSM, the expression levels of all genes induce the same linear ordering of the experiments. The OPSM problem is to discover those statistically significant OPSMs from a given data matrix. The problem is reducible to a special case of the sequential pattern mining problem, where a pattern and its supporting sequences uniquely specify an OPSM. Unfortunately, existing methods do not scale well to massive data sets containing thousands of experiments and hundreds of thousands of genes, which are common in today's gene expression analysis. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs in their discovery and are completely pruned off by existing methods. However, it is of particular interest of biologists to determine small groups of genes that are tightly coregulated across many experiments, and some pathways or processes may require as few as two genes to act in concert. In this paper, we study the discovery of deep OPSMs from massive data sets. We propose a novel best effort mining framework Kiwi that exploits two parameters k and w to bound the available computational resources and search a selected search space, and does what it can to find as many as possible deep OPSMs. Extensive biological and computational evaluations on real data sets demonstrate the validity and importance of the deep OPSM problem, and the efficiency and effectiveness of the Kiwi mining framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Spectral Characterization and Scalable Mining of Network Communities

    Page(s): 326 - 337
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1565 KB) |  | HTML iconHTML  

    Network communities refer to groups of vertices within which their connecting links are dense but between which they are sparse. A network community mining problem (or NCMP for short) is concerned with the problem of finding all such communities from a given network. A wide variety of applications can be formulated as NCMPs, ranging from social and/or biological network analysis to web mining and searching. So far, many algorithms addressing NCMPs have been developed and most of them fall into the categories of either optimization based or heuristic methods. Distinct from the existing studies, the work presented in this paper explores the notion of network communities and their properties based on the dynamics of a stochastic model naturally introduced. In the paper, a relationship between the hierarchical community structure of a network and the local mixing properties of such a stochastic model has been established with the large-deviation theory. Topological information regarding to the community structures hidden in networks can be inferred from their spectral signatures. Based on the above-mentioned relationship, this work proposes a general framework for characterizing, analyzing, and mining network communities. Utilizing the two basic properties of metastability, i.e., being locally uniform and temporarily fixed, an efficient implementation of the framework, called the LM algorithm, has been developed that can scalably mine communities hidden in large-scale networks. The effectiveness and efficiency of the LM algorithm have been theoretically analyzed as well as experimentally validated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Outsourced Similarity Search on Metric Data Assets

    Page(s): 338 - 352
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1211 KB) |  | HTML iconHTML  

    This paper considers a cloud computing setting in which similarity querying of metric data is outsourced to a service provider. The data is to be revealed only to trusted users, not to the service provider or anyone else. Users query the server for the most similar data objects to a query example. Outsourcing offers the data owner scalability and a low-initial investment. The need for privacy may be due to the data being sensitive (e.g., in medicine), valuable (e.g., in astronomy), or otherwise confidential. Given this setting, the paper presents techniques that transform the data prior to supplying it to the service provider for similarity queries on the transformed data. Our techniques provide interesting trade-offs between query cost and accuracy. They are then further extended to offer an intuitive privacy guarantee. Empirical studies with real data demonstrate that the techniques are capable of offering privacy while enabling efficient and accurate processing of similarity queries. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Privacy Preserving Decision Tree Learning Using Unrealized Data Sets

    Page(s): 353 - 364
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (11062 KB)  

    Privacy preservation is important for machine learning and data mining, but measures designed to protect private information often result in a trade-off: reduced utility of the training samples. This paper introduces a privacy preserving approach that can be applied to decision tree learning, without concomitant loss of accuracy. It describes an approach to the preservation of the privacy of collected data samples in cases where information from the sample database has been partially lost. This approach converts the original sample data sets into a group of unreal data sets, from which the original samples cannot be reconstructed without the entire group of unreal data sets. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. This novel approach can be applied directly to the data storage as soon as the first sample is collected. The approach is compatible with other privacy preserving approaches, such as cryptography, for extra protection. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Subspace Similarity Search under {rm L}_p-Norm

    Page(s): 365 - 382
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1304 KB) |  | HTML iconHTML  

    Similarity search has been widely used in many applications such as information retrieval, image data analysis, and time-series matching. Previous work on similarity search usually consider the search problem in the full space. In this paper, however, we tackle a problem, subspace similarity search, which finds all data objects that match with a query object in the subspace instead of the original full space. In particular, the query object can specify arbitrary subspace with arbitrary number of dimensions. Due to the exponential number of possible subspaces specified by users, we introduce an efficient and effective pruning technique, which assigns scores to data objects with respect to pivots and prunes candidates via scores. We propose an effective multipivot-based method to preprocess data objects by selecting appropriate pivots, where the entire procedure is guided by a formal cost model, such that the pruning power is maximized. Then, scores of each data object are organized in sorted lists to facilitate an efficient subspace similarity search. Furthermore, many real-world application data such as image databases, time-series data, and sensory data often contain noises, which can be modeled as uncertain objects. Different from certain data, efficient query processing on uncertain data is more challenging due to its intensive computation of probability confidences. Thus, it is also crucial to answer subspace queries efficiently and effectively over uncertain objects. Specifically, we define a novel query, namely probabilistic subspace range query (PSRQ) in the uncertain database, which finds objects within a distance from a query object in any subspace with high probability. To address this query, we extend our proposed pruning techniques for precise data to that of answering PSRQ in arbitrary subspaces. Extensive experiments demonstrated the performance of our proposed approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Computer Society OnlinePlus Coming Soon to TKDE

    Page(s): 383
    Save to Project icon | Request Permissions | PDF file iconPDF (221 KB)  
    Freely Available from IEEE
  • What's new in Transactions [advertisement]

    Page(s): 384
    Save to Project icon | Request Permissions | PDF file iconPDF (344 KB)  
    Freely Available from IEEE
  • [Cover3]

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (156 KB)  
    Freely Available from IEEE
  • [Cover 4]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (116 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University