By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 4 • Date April 2014

Filter Results

Displaying Results 1 - 19 of 19
  • A Two-Level Topic Model Towards Knowledge Discovery from Citation Networks

    Page(s): 780 - 794
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2135 KB) |  | HTML iconHTML  

    Knowledge discovery from scientific articles has received increasing attention recently since huge repositories are made available by the development of the Internet and digital databases. In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. In the existing topic models, little effort is made to differentiate these two roles. We believe that the topic distributions of these two roles are different and related in a certain way. In this paper, we propose a Bernoulli process topic (BPT) model which considers the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach. An efficient computation algorithm is proposed to overcome the difficulty of matrix inverse operation. In addition to conducting the experimental evaluations on the document modeling and document clustering tasks, we also apply the BPT model to well known corpora to discover the latent topics, recommend important citations, detect the trends of various research areas in computer science between 1991 and 1998, and to investigate the interactions among the research areas. The comparisons against state-of-the-art methods demonstrate a very promising performance. The implementations and the data sets are available online . View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data

    Page(s): 795 - 807
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4141 KB) |  | HTML iconHTML  

    Access control mechanisms protect sensitive information from unauthorized users. However, when sensitive information is shared and a Privacy Protection Mechanism (PPM) is not in place, an authorized user can still compromise the privacy of a person leading to identity disclosure. A PPM can use suppression and generalization of relational data to anonymize and satisfy privacy requirements, e.g., k-anonymity and l-diversity, against identity and attribute disclosure. However, privacy is achieved at the cost of precision of authorized information. In this paper, we propose an accuracy-constrained privacy-preserving access control framework. The access control policies define selection predicates available to roles while the privacy requirement is to satisfy the k-anonymity or l-diversity. An additional constraint that needs to be satisfied by the PPM is the imprecision bound for each selection predicate. The techniques for workload-aware anonymization for selection predicates have been discussed in the literature. However, to the best of our knowledge, the problem of satisfying the accuracy constraints for multiple roles has not been studied before. In our formulation of the aforementioned problem, we propose heuristics for anonymization algorithms and show empirically that the proposed approach satisfies imprecision bounds for more permissions and has lower total imprecision than the current state of the art. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active Learning without Knowing Individual Instance Labels: A Pairwise Label Homogeneity Query Approach

    Page(s): 808 - 822
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3586 KB) |  | HTML iconHTML  

    Traditional active learning methods require the labeler to provide a class label for each queried instance. The labelers are normally highly skilled domain experts to ensure the correctness of the provided labels, which in turn results in expensive labeling cost. To reduce labeling cost, an alternative solution is to allow nonexpert labelers to carry out the labeling task without explicitly telling the class label of each queried instance. In this paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked “whether a pair of instances belong to the same class”, namely, a pairwise label homogeneity. Under such circumstances, our active learning goal is twofold: (1) decide which pair of instances should be selected for query, and (2) how to make use of the pairwise homogeneity information to improve the active learner. To achieve the goal, we propose a “Pairwise Query on Max-flow Paths” strategy to query pairwise label homogeneity from a nonexpert labeler, whose query results are further used to dynamically update a Min-cut model (to differentiate instances in different classes). In addition, a “Confidence-based Data Selection” measure is used to evaluate data utility based on the Min-cut model's prediction results. The selected instances, with inferred class labels, are included into the labeled set to form a closed-loop active learning process. Experimental results and comparisons with state-of-the-art methods demonstrate that our new active learning paradigm can result in good performance with nonexpert labelers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Automated Framework for Incorporating News into Stock Trading Strategies

    Page(s): 823 - 835
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2491 KB) |  | HTML iconHTML  

    In this paper we present a framework for automatic exploitation of news in stock trading strategies. Events are extracted from news messages presented in free text without annotations. We test the introduced framework by deriving trading strategies based on technical indicators and impacts of the extracted events. The strategies take the form of rules that combine technical trading indicators with a news variable, and are revealed through the use of genetic programming. We find that the news variable is often included in the optimal trading rules, indicating the added value of news for predictive purposes and validating our proposed framework for automatically incorporating news in stock trading strategies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CoRE: A Context-Aware Relation Extraction Method for Relation Completion

    Page(s): 836 - 849
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1569 KB) |  | HTML iconHTML  

    We identify relation completion (RC) as one recurring problem that is central to the success of novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation ℜ, RC attempts at linking entity pairs between two entity lists under the relation ℜ. To accomplish the RC goals, we propose to formulate search queries for each query entity α based on some auxiliary information, so that to detect its target entity β from the set of retrieved documents. For instance, a pattern-based method (PaRE) uses extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method that uses context terms learned surrounding the expression of a relation as the auxiliary information in formulating queries. The experimental results based on several real-world web data collections demonstrate that CoRE reaches a much higher accuracy than PaRE for the purpose of RC. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Ranking on Entity Graphs with Personalized Relationships

    Page(s): 850 - 863
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1449 KB) |  | HTML iconHTML  

    Authority flow techniques like PageRank and ObjectRank can provide personalized ranking of typed entity-relationship graphs. There are two main ways to personalize authority flow ranking: Node-based personalization, where authority originates from a set of user-specific nodes; edge-based personalization, where the importance of different edge types is user-specific. We propose the first approach to achieve efficient edge-based personalization using a combination of precomputation and runtime algorithms. In particular, we apply our method to ObjectRank, where a personalized weight assignment vector (WAV) assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings for various WAVs. We consider the following two classes of approximation: (a) SchemaApprox is formulated as a distance minimization problem at the schema level; (b) DataApprox is a distance minimization problem at the data graph level. SchemaApprox is not robust since it does not distinguish between important and trivial edge types based on the edge distribution in the data graph. In contrast, DataApprox has a provable error bound. Both SchemaApprox and DataApprox are expensive so we develop efficient heuristic implementations, ScaleRank and PickOne respectively. Extensive experiments on the DBLP data graph show that ScaleRank provides a fast and accurate personalized authority flow ranking. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extended Subtree: A New Similarity Function for Tree Structured Data

    Page(s): 864 - 877
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1810 KB) |  | HTML iconHTML  

    Although several distance or similarity functions for trees have been introduced, their performance is not always satisfactory in many applications, ranging from document clustering to natural language processing. This research proposes a new similarity function for trees, namely Extended Subtree (EST), where a new subtree mapping is proposed. EST generalizes the edit base distances by providing new rules for subtree mapping. Further, the new approach seeks to resolve the problems and limitations of previous approaches. Extensive evaluation frameworks are developed to evaluate the performance of the new approach against previous proposals. Clustering and classification case studies utilizing three real-world and one synthetic labeled data sets are performed to provide an unbiased evaluation where different distance functions are investigated. The experimental results demonstrate the superior performance of the proposed distance function. In addition, an empirical runtime analysis demonstrates that the new approach is one of the best tree distance functions in terms of runtime efficiency. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Nearest Neighbor Search with Keywords

    Page(s): 878 - 888
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1028 KB) |  | HTML iconHTML  

    Conventional spatial queries, such as range search and nearest neighbor retrieval, involve only conditions on objects' geometric properties. Today, many modern applications call for novel forms of queries that aim to find objects satisfying both a spatial predicate, and a predicate on their associated texts. For example, instead of considering all the restaurants, a nearest neighbor query would instead ask for the restaurant that is the closest among those whose menus contain “steak, spaghetti, brandy” all at the same time. Currently, the best solution to such queries is based on the IR 2-tree, which, as shown in this paper, has a few deficiencies that seriously impact its efficiency. Motivated by this, we develop a new access method called the spatial inverted index that extends the conventional inverted index to cope with multidimensional data, and comes with algorithms that can answer nearest neighbor queries with keywords in real time. As verified by experiments, the proposed techniques outperform the IR 2-tree in query response time significantly, often by a factor of orders of magnitude. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Activity Recognition by Segmental Pattern Mining

    Page(s): 889 - 902
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2783 KB) |  | HTML iconHTML  

    Activity recognition is a key task for the development of advanced and effective ubiquitous applications in fields like ambient assisted living. A major problem in designing effective recognition algorithms is the difficulty of incorporating long-range dependencies between distant time instants without incurring substantial increase in computational complexity of inference. In this paper we present a novel approach for introducing long-range interactions based on sequential pattern mining. The algorithm searches for patterns characterizing time segments during which the same activity is performed. A probabilistic model is learned to represent the distribution of pattern matches along sequences, trying to maximize the coverage of an activity segment by a pattern match. The model is integrated in a segmental labeling algorithm and applied to novel sequences, tagged according to matches of the extracted patterns. The rationale of the approach is that restricting dependencies to span the same activity segment (i.e., sharing the same label), allows keeping inference tractable. An experimental evaluation shows that enriching sensor-based representations with the mined patterns allows improving results over sequential and segmental labeling algorithms in most of the cases. An analysis of the discovered patterns highlights non-trivial interactions spanning over a significant time horizon. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Infrequent Weighted Itemset Mining Using Frequent Pattern Growth

    Page(s): 903 - 915
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2831 KB) |  | HTML iconHTML  

    Frequent weighted itemsets represent correlations frequently holding in data in which items may weight differently. However, in some contexts, e.g., when the need is to minimize a certain cost function, discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of discovering rare and weighted itemsets, i.e., the infrequent weighted itemset (IWI) mining problem. Two novel quality measures are proposed to drive the IWI mining process. Furthermore, two algorithms that perform IWI and Minimal IWI mining efficiently, driven by the proposed measures, are presented. Experimental results show efficiency and effectiveness of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Local Thresholding in General Network Graphs

    Page(s): 916 - 928
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1385 KB) |  | HTML iconHTML  

    Local thresholding algorithms were first presented more than a decade ago and have since been applied to a variety of data mining tasks in peer-to-peer systems, wireless sensor networks, and in grid systems. One critical assumption made by those algorithms has always been cycle-free routing. The existence of even one cycle may lead all peers to the wrong outcome. Outside the lab, unfortunately, cycle freedom is not easy to achieve. This work is the first to lift the requirement of cycle freedom by presenting a local thresholding algorithm suitable for general network graphs. The algorithm relies on a new repositioning of the problem in weighted vector arithmetics, on a new stopping rule, whose proof does not require that the network be cycle free, and on new methods for balance correction when the stopping rule fails. The new stopping and update rules permit calculation of the very same functions that were calculable using previous algorithms, which do assume cycle freedom. The algorithm is implemented on a standard peer-to-peer simulator and is validated for networks of up to 80,000 peers, organized in three different topologies representative of major current distributed systems: the Internet, structured peer-to-peer systems, and wireless sensor networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MultiComm: Finding Community Structure in Multi-Dimensional Networks

    Page(s): 929 - 941
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4174 KB) |  | HTML iconHTML  

    The main aim of this paper is to develop a community discovery scheme in a multi-dimensional network for data mining applications. In online social media, networked data consists of multiple dimensions/entities such as users, tags, photos, comments, and stories. We are interested in finding a group of users who interact significantly on these media entities. In a co-citation network, we are interested in finding a group of authors who relate to other authors significantly on publication information in titles, abstracts, and keywords as multiple dimensions/entities in the network. The main contribution of this paper is to propose a framework (MultiComm)to identify a seed-based community in a multi-dimensional network by evaluating the affinity between two items in the same type of entity (same dimension)or different types of entities (different dimensions)from the network. Our idea is to calculate the probabilities of visiting each item in each dimension, and compare their values to generate communities from a set of seed items. In order to evaluate a high quality of generated communities by the proposed algorithm, we develop and study a local modularity measure of a community in a multi-dimensional network. Experiments based on synthetic and real-world data sets suggest that the proposed framework is able to find a community effectively. Experimental results have also shown that the performance of the proposed algorithm is better in accuracy than the other testing algorithms in finding communities in multi-dimensional networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Skyline Groups

    Page(s): 942 - 956
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3784 KB) |  | HTML iconHTML  

    We formulate and investigate the novel problem of finding the skyline k-tuple groups from an n-tuple data set-i.e., groups of k tuples which are not dominated by any other group of equal size, based on aggregate-based group dominance relationship. The major technical challenge is to identify effective anti-monotonic properties for pruning the search space of skyline groups. To this end, we first show that the anti-monotonic property in the well-known Apriori algorithm does not hold for skyline group pruning. Then, we identify two anti-monotonic properties with varying degrees of applicability: order-specific property which applies to SUM, MIN, and MAX as well as weak candidate-generation property which applies to MIN and MAX only. Experimental results on both real and synthetic data sets verify that the proposed algorithms achieve orders of magnitude performance gain over the baseline method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quasi-SLCA Based Keyword Query Processing over Probabilistic XML Data

    Page(s): 957 - 969
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1663 KB) |  | HTML iconHTML  

    The probabilistic threshold query is one of the most common queries in uncertain databases, where a result satisfying the query must be also with probability meeting the threshold requirement. In this paper, we investigate probabilistic threshold keyword queries (PrTKQ)over XML data, which is not studied before. We first introduce the notion of quasi-SLCA and use it to represent results for a PrTKQ with the consideration of possible world semantics. Then we design a probabilistic inverted (PI)index that can be used to quickly return the qualified answers and filter out the unqualified ones based on our proposed lower/upper bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-based Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An empirical study using real and synthetic data sets has verified the effectiveness and the efficiency of our approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Secure Mining of Association Rules in Horizontally Distributed Databases

    Page(s): 970 - 983
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1575 KB) |  | HTML iconHTML  

    We propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton . Our protocol, like theirs, is based on the Fast Distributed Mining (FDM)algorithm of Cheung et al. , which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms-one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in . In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Security Evaluation of Pattern Classifiers under Attack

    Page(s): 984 - 996
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1076 KB) |  | HTML iconHTML  

    Pattern classification systems are commonly used in adversarial applications, like biometric authentication, network intrusion detection, and spam filtering, in which data can be purposely manipulated by humans to undermine their operation. As this adversarial scenario is not taken into account by classical design methods, pattern classification systems may exhibit vulnerabilities, whose exploitation may severely affect their performance, and consequently limit their practical utility. Extending pattern classification theory and design methods to adversarial settings is thus a novel and very relevant research direction, which has not yet been pursued in a systematic way. In this paper, we address one of the main open issues: evaluating at design phase the security of pattern classifiers, namely, the performance degradation under potential attacks they may incur during operation. We propose a framework for empirical evaluation of classifier security that formalizes and generalizes the main ideas proposed in the literature, and give examples of its use in three real applications. Reported results show that security evaluation can provide a more complete understanding of the classifier's behavior in adversarial environments, and lead to better design choices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Shortest Path Computing in Relational DBMSs

    Page(s): 997 - 1011
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3236 KB) |  | HTML iconHTML  

    This paper takes the shortest path discovery to study efficient relational approaches to graph search queries. We first abstract three enhanced relational operators, based on which we introduce an FEM framework to bridge the gap between relational operations and graph operations. We show new features introduced by recent SQL standards, such as window function and merge statement, can improve the performance of the FEM framework. Second, we propose an edge weight aware graph partitioning schema and design a bi-directional restrictive BFS (breadth-first-search)over partitioned tables, which improves the scalability and performance without extra indexing overheads. The final extensive experimental results illustrate our relational approach with optimization strategies can achieve high scalability and performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards Online Shortest Path Computation

    Page(s): 1012 - 1025
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2709 KB) |  | HTML iconHTML  

    The online shortest path problem aims at computing the shortest path based on live traffic circumstances. This is very important in modern car navigation systems as it helps drivers to make sensible decisions. To our best knowledge, there is no efficient system/solution that can offer affordable costs at both client and server sides for online shortest path computation. Unfortunately, the conventional client-server architecture scales poorly with the number of clients. A promising approach is to let the server collect live traffic information and then broadcast them over radio or wireless network. This approach has excellent scalability with the number of clients. Thus, we develop a new framework called live traffic index (LTI)which enables drivers to quickly and effectively collect the live traffic information on the broadcasting channel. An impressive result is that the driver can compute/update their shortest path result by receiving only a small fraction of the index. Our experimental study shows that LTI is robust to various parameters and it offers relatively short tune-in cost (at client side), fast query response time (at client side), small broadcast size (at server side), and light maintenance time (at server side)for online shortest path problem. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Versatile Size-$l$ Object Summaries for Relational Keyword Search

    Page(s): 1026 - 1038
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2534 KB) |  | HTML iconHTML  

    The Object Summary (OS)is a recently proposed tree structure, which summarizes all data held in a relational database about a data subject. An OS can potentially be very large in size and therefore unfriendly for users who wish to view synoptic information about the data subject. In this paper, we investigate the effective and efficient retrieval of concise and informative OS snippets (denoted as size-l OSs). We propose and investigate the effectiveness of two types of size- l OSs, namely size- l OS (t)s and size-l OS (a)s that consist of l tuple nodes and l attribute nodes respectively. For computing size-l OSs, we propose an optimal dynamic programming algorithm, two greedy algorithms and preprocessing heuristics. By collecting feedback from real users (e.g., from DBLP authors), we assess the relative usability of the two different types of snippets, the choice of the size- l parameter, as well as the effectiveness of the snippets with respect to the user expectations. In addition, via thorough evaluation on real databases, we test the speed and effectiveness of our techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University