By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 1 • Date Jan. 2013

Filter Results

Displaying Results 1 - 20 of 20
  • A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data

    Page(s): 1 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3812 KB)  

    Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Graph-Based Consensus Maximization Approach for Combining Multiple Supervised and Unsupervised Models

    Page(s): 15 - 28
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2170 KB) |  | HTML iconHTML  

    Ensemble learning has emerged as a powerful method for combining multiple models. Well-known methods, such as bagging, boosting, and model averaging, have been shown to improve accuracy and robustness over single models. However, due to the high costs of manual labeling, it is hard to obtain sufficient and reliable labeled data for effective training. Meanwhile, lots of unlabeled data exist in these sources, and we can readily obtain multiple unsupervised models. Although unsupervised models do not directly generate a class label prediction for each object, they provide useful constraints on the joint predictions for a set of related objects. Therefore, incorporating these unsupervised models into the ensemble of supervised models can lead to better prediction performance. In this paper, we study ensemble learning with outputs from multiple supervised and unsupervised models, a topic where little work has been done. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the predictions over the graph, but penalizes the deviations from the initial labeling provided by the supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes and prove the optimality of the solution. The proposed method can be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on different applications with heterogeneous data sources demonstrate the benefits of the proposed method over existing alternatives. (More information, data, and code are available at http://www.cse.buffalo.edu/~jing/integrate.htm.) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Survey of XML Tree Patterns

    Page(s): 29 - 46
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1412 KB) |  | HTML iconHTML  

    With XML becoming a ubiquitous language for data interoperability purposes in various domains, efficiently querying XML data is a critical issue. This has lead to the design of algebraic frameworks based on tree-shaped patterns akin to the tree-structured data model of XML. Tree patterns are graphic representations of queries over data trees. They are actually matched against an input data tree to answer a query. Since the turn of the 21st century, an astounding research effort has been focusing on tree pattern models and matching optimization (a primordial issue). This paper is a comprehensive survey of these topics, in which we outline and compare the various features of tree patterns. We also review and discuss the two main families of approaches for optimizing tree pattern matching, namely pattern tree minimization and holistic matching. We finally present actual tree pattern-based developments, to provide a global overview of this significant research topic. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology and Rule-Based Model

    Page(s): 47 - 61
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1627 KB) |  | HTML iconHTML  

    Recent increase in the use of video-based applications has revealed the need for extracting the content in videos. Raw data and low-level features alone are not sufficient to fulfill the user 's needs; that is, a deeper understanding of the content at the semantic level is required. Currently, manual techniques, which are inefficient, subjective and costly in time and limit the querying capabilities, are being used to bridge the gap between low-level representative features and high-level semantic content. Here, we propose a semantic content extraction system that allows the user to query and retrieve objects, events, and concepts that are extracted automatically. We introduce an ontology-based fuzzy video semantic content model that uses spatial/temporal relations in event and concept definitions. This metaontology definition provides a wide-domain applicable rule construction standard that allows the user to construct an ontology for a given domain. In addition to domain ontologies, we use additional rule definitions (without using ontology) to lower spatial relation computation cost and to be able to define some complex situations more effectively. The proposed framework has been fully implemented and tested on three different domains. We have obtained satisfactory precision and recall rates for object, event and concept extraction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm

    Page(s): 62 - 75
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2862 KB)  

    In comparison with hard clustering methods, in which a pattern belongs to a single cluster, fuzzy clustering algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in domains such as sentence clustering, since a sentence is likely to be related to more than one theme or topic present within a document or set of documents. However, because most sentence similarity measures do not represent sentences in a common metric space, conventional fuzzy clustering approaches based on prototypes or mixtures of Gaussians are generally not applicable to sentence clustering. This paper presents a novel fuzzy clustering algorithm that operates on relational input data; i.e., data in the form of a square matrix of pairwise similarities between data objects. The algorithm uses a graph representation of the data, and operates in an Expectation-Maximization framework in which the graph centrality of an object in the graph is interpreted as a likelihood. Results of applying the algorithm to sentence clustering tasks demonstrate that the algorithm is capable of identifying overlapping clusters of semantically related sentences, and that it is therefore of potential use in a variety of text mining tasks. We also include results of applying the algorithm to benchmark data sets in several other domains. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed Processing of Probabilistic Top-k Queries in Wireless Sensor Networks

    Page(s): 76 - 91
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1547 KB) |  | HTML iconHTML  

    In this paper, we introduce the notion of sufficient set and necessary set for distributed processing of probabilistic top-k queries in cluster-based wireless sensor networks. These two concepts have very nice properties that can facilitate localized data pruning in clusters. Accordingly, we develop a suite of algorithms, namely, sufficient set-based (SSB), necessary set-based (NSB), and boundary-based (BB), for intercluster query processing with bounded rounds of communications. Moreover, in responding to dynamic changes of data distribution in the network, we develop an adaptive algorithm that dynamically switches among the three proposed algorithms to minimize the transmission cost. We show the applicability of sufficient set and necessary set to wireless sensor networks with both two-tier hierarchical and tree-structured network topologies. Experimental results show that the proposed algorithms reduce data transmissions significantly and incur only small constant rounds of data communications. The experimental results also demonstrate the superiority of the adaptive algorithm, which achieves a near-optimal performance under various conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluating Data Reliability: An Evidential Answer with Application to a Web-Enabled Data Warehouse

    Page(s): 92 - 105
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1359 KB) |  | HTML iconHTML  

    There are many available methods to integrate information source reliability in an uncertainty representation, but there are only a few works focusing on the problem of evaluating this reliability. However, data reliability and confidence are essential components of a data warehousing system, as they influence subsequent retrieval and analysis. In this paper, we propose a generic method to assess data reliability from a set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided. The chosen illustrative example comes from real-world data issued from the Sym'Previus predictive microbiology oriented data warehouse. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large Graph Analysis in the GMine System

    Page(s): 106 - 118
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1832 KB)  

    Current applications have produced graphs on the order of hundreds of thousands of nodes and millions of edges. To take advantage of such graphs, one must be able to find patterns, outliers, and communities. These tasks are better performed in an interactive environment, where human expertise can guide the process. For large graphs, though, there are some challenges: the excessive processing requirements are prohibitive, and drawing hundred-thousand nodes results in cluttered images hard to comprehend. To cope with these problems, we propose an innovative framework suited for any kind of tree-like graph visual design. GMine integrates 1) a representation for graphs organized as hierarchies of partitions-the concepts of SuperGraph and Graph-Tree; and 2) a graph summarization methodology-CEPS. Our graph representation deals with the problem of tracing the connection aspects of a graph hierarchy with sub linear complexity, allowing one to grasp the neighborhood of a single node or of a group of nodes in a single click. As a proof of concept, the visual environment of GMine is instantiated as a system in which large graphs can be investigated globally and locally. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximum Likelihood Estimation from Uncertain Data in the Belief Function Framework

    Page(s): 119 - 130
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (877 KB)  

    We consider the problem of parameter estimation in statistical models in the case where data are uncertain and represented as belief functions. The proposed method is based on the maximization of a generalized likelihood criterion, which can be interpreted as a degree of agreement between the statistical model and the uncertain observations. We propose a variant of the EM algorithm that iteratively maximizes this criterion. As an illustration, the method is applied to uncertain data clustering using finite mixture models, in the cases of categorical and continuous attributes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case Studies

    Page(s): 131 - 144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1705 KB) |  | HTML iconHTML  

    In this paper, we study sparsity-exploiting Mastermind algorithms for attacking the privacy of an entire database of character strings or vectors, such as DNA strings, movie ratings, or social network friendship data. Based on reductions to nonadaptive group testing, our methods are able to take advantage of minimal amounts of privacy leakage, such as contained in a single bit that indicates if two people in a medical database have any common genetic mutations, or if two people have any common friends in an online social network. We analyze our Mastermind attack algorithms using theoretical characterizations that provide sublinear bounds on the number of queries needed to clone the database, as well as experimental tests on genomic information, collaborative filtering data, and online social networks. By taking advantage of the generally sparse nature of these real-world databases and modulating a parameter that controls query sparsity, we demonstrate that relatively few nonadaptive queries are needed to recover a large majority of each database. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Recovery of R-Trees

    Page(s): 145 - 157
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (354 KB) |  | HTML iconHTML  

    We consider the recoverability of traditional R-tree index structures under concurrent updating transactions, an important issue that is neglected or treated inadequately in many proposals of R-tree concurrency control. We present two solutions to ARIES-based recovery of transactions on R-trees. These assume a standard fine-grained single-version update model with physiological write-ahead logging and steal-and-no-force buffering where records with uncommitted updates by a transaction may migrate from their original page to another page due to structure modifications caused by other transactions. Both solutions guarantee that an R-tree will remain in a consistent and balanced state in the presence of any number of concurrent forward-rolling and (totally or partially) backward-rolling multiaction transactions and in the event of process failures and system crashes. One solution maintains the R-tree in a strictly consistent state in which the bounding rectangles of pages are as tight as possible, while in the other solution this requirement is relaxed. In both solutions only a small constant number of simultaneous exclusive latches (write latches) are needed, and in the solution that only maintains relaxed consistency also the number of simultaneous nonexclusive latches is similarly limited. In both solutions, deletions are handled uniformly with insertions, and a logarithmic insertion-path length is maintained under all circumstances. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ontology Matching: State of the Art and Future Challenges

    Page(s): 158 - 176
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1152 KB) |  | HTML iconHTML  

    After years of research on ontology matching, it is reasonable to consider several questions: is the field of ontology matching still making progress? Is this progress significant enough to pursue further research? If so, what are the particularly promising directions? To answer these questions, we review the state of the art of ontology matching and analyze the results of recent ontology matching evaluations. These results show a measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching. We present such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ranking on Data Manifold with Sink Points

    Page(s): 177 - 191
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2948 KB)  

    Ranking is an important problem in various applications, such as Information Retrieval (IR), natural language processing, computational biology, and social sciences. Many ranking approaches have been proposed to rank objects according to their degrees of relevance or importance. Beyond these two goals, diversity has also been recognized as a crucial criterion in ranking. Top ranked results are expected to convey as little redundant information as possible, and cover as many aspects as possible. However, existing ranking approaches either take no account of diversity, or handle it separately with some heuristics. In this paper, we introduce a novel approach, Manifold Ranking with Sink Points (MRSPs), to address diversity as well as relevance and importance in ranking. Specifically, our approach uses a manifold ranking process over the data manifold, which can naturally find the most relevant and important data objects. Meanwhile, by turning ranked objects into sink points on data manifold, we can effectively prevent redundant objects from receiving a high rank. MRSP not only shows a nice convergence property, but also has an interesting and satisfying optimization explanation. We applied MRSP on two application tasks, update summarization and query recommendation, where diversity is of great concern in ranking. Experimental results on both tasks present a strong empirical performance of MRSP as compared to existing ranking approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Region-Based Foldings in Process Discovery

    Page(s): 192 - 205
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2039 KB)  

    A central problem in the area of Process Mining is to obtain a formal model that represents the processes that are conducted in a system. If realized, this simple motivation allows for powerful techniques that can be used to formally analyze and optimize a system, without the need to resort to its semiformal and sometimes inaccurate specification. The problem addressed in this paper is known as Process Discovery: to obtain a formal model from a set of system executions. The theory of regions is a valuable tool in process discovery: it aims at learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is applied on an automaton and therefore one should convert the traces into an acyclic automaton in order to apply these techniques. Given that the complexity of the region-based techniques depends on the size of the input automata, revealing the underlying cycles and folding the initial automaton can incur in a significant complexity alleviation of the region-based techniques. In this paper, we follow this idea by incorporating region information in the cycle detection algorithm, enabling the identification of complex cycles that cannot be obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool suggest that the techniques presented in this paper are a big step into widening the application of the theory of regions in Process Mining for industrial scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Relationships between Diversity of Classification Ensembles and Single-Class Performance Measures

    Page(s): 206 - 219
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1682 KB) |  | HTML iconHTML  

    In class imbalance learning problems, how to better recognize examples from the minority class is the key focus, since it is usually more important and expensive than the majority class. Quite a few ensemble solutions have been proposed in the literature with varying degrees of success. It is generally believed that diversity in an ensemble could help to improve the performance of class imbalance learning. However, no study has actually investigated diversity in depth in terms of its definitions and effects in the context of class imbalance learning. It is unclear whether diversity will have a similar or different impact on the performance of minority and majority classes. In this paper, we aim to gain a deeper understanding of if and when ensemble diversity has a positive impact on the classification of imbalanced data sets. First, we explain when and why diversity measured by Q-statistic can bring improved overall accuracy based on two classification patterns proposed by Kuncheva et al. We define and give insights into good and bad patterns in imbalanced scenarios. Then, the pattern analysis is extended to single-class performance measures, including recall, precision, and F-measure, which are widely used in class imbalance learning. Six different situations of diversity's impact on these measures are obtained through theoretical analysis. Finally, to further understand how diversity affects the single class performance and overall performance in class imbalance problems, we carry out extensive experimental studies on both artificial data sets and real-world benchmarks with highly skewed class distributions. We find strong correlations between diversity and discussed performance measures. Diversity shows a positive impact on the minority class in general. It is also beneficial to the overall performance in terms of AUC and G-mean. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence

    Page(s): 220 - 232
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2361 KB)  

    This paper presents a smart driving direction system leveraging the intelligence of experienced drivers. In this system, GPS-equipped taxis are employed as mobile sensors probing the traffic rhythm of a city and taxi drivers' intelligence in choosing driving directions in the physical world. We propose a time-dependent landmark graph to model the dynamic traffic pattern as well as the intelligence of experienced drivers so as to provide a user with the practically fastest route to a given destination at a given departure time. Then, a Variance-Entropy-Based Clustering approach is devised to estimate the distribution of travel time between two landmarks in different time slots. Based on this graph, we design a two-stage routing algorithm to compute the practically fastest and customized route for end users. We build our system based on a real-world trajectory data set generated by over 33,000 taxis in a period of three months, and evaluate the system by conducting both synthetic experiments and in-the-field evaluations. As a result, 60-70 percent of the routes suggested by our method are faster than the competing methods, and 20 percent of the routes share the same results. On average, 50 percent of our routes are at least 20 percent faster than the competing approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2012 Reviewer's list

    Page(s): 233 - 238
    Save to Project icon | Request Permissions | PDF file iconPDF (85 KB)  
    Freely Available from IEEE
  • 2012 Annual index

    Page(s): Not in print
    Save to Project icon | Request Permissions | PDF file iconPDF (672 KB)  
    Freely Available from IEEE
  • New transactions newsletter! [advertisement]

    Page(s): 239
    Save to Project icon | Request Permissions | PDF file iconPDF (1647 KB)  
    Freely Available from IEEE
  • What's New in Transactions [advertisement]

    Page(s): 240
    Save to Project icon | Request Permissions | PDF file iconPDF (348 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University