By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 1 • Date Jan. 2014

Filter Results

Displaying Results 1 - 23 of 23
  • Editorial [State of the Transactions]

    Page(s): 1 - 2
    Save to Project icon | Request Permissions | PDF file iconPDF (177 KB)  
    Freely Available from IEEE
  • A Comparative Study of Implementation Techniques for Query Processing in Multicore Systems

    Page(s): 3 - 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1869 KB)  

    Multicore systems and multithreaded processing are now the de facto standards of enterprise and personal computing. If used in an uninformed way, however, multithreaded processing might actually degrade performance. We present the facets of the memory access bottleneck as they manifest in multithreaded processing and show their impact on query evaluation. We present a system design based on partition parallelism, memory pooling, and data structures conducive to multithreaded processing. Based on this design, we present alternative implementations of the most common query processing algorithms, which we experimentally evaluate using multiple scenarios and hardware platforms. Our results show that the design and algorithms are indeed scalable across platforms, but the choice of optimal algorithm largely depends on the problem parameters and underlying hardware. However, our proposals are a good first step toward generic multithreaded parallelism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Rough Hypercuboid Approach for Feature Selection in Approximation Spaces

    Page(s): 16 - 29
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3121 KB)  

    The selection of relevant and significant features is an important problem particularly for data sets with large number of features. In this regard, a new feature selection algorithm is presented based on a rough hypercuboid approach. It selects a set of features from a data set by maximizing the relevance, dependency, and significance of the selected features. By introducing the concept of the hypercuboid equivalence partition matrix, a novel representation of degree of dependency of sample categories on features is proposed to measure the relevance, dependency, and significance of features in approximation spaces. The equivalence partition matrix also offers an efficient way to calculate many more quantitative measures to describe the inexactness of approximate classification. Several quantitative indices are introduced based on the rough hypercuboid approach for evaluating the performance of the proposed method. The superiority of the proposed method over other feature selection methods, in terms of computational complexity and classification accuracy, is established extensively on various real-life data sets of different sizes and dimensions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Empirical Performance Evaluation of Relational Keyword Search Techniques

    Page(s): 30 - 42
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1635 KB)  

    Extending the keyword search paradigm to relational data has been an active area of research within the database and IR community during the past decade. Many approaches have been proposed, but despite numerous publications, there remains a severe lack of standardization for the evaluation of proposed search techniques. Lack of standardization has resulted in contradictory results from different evaluations, and the numerous discrepancies muddle what advantages are proffered by different approaches. In this paper, we present the most extensive empirical performance evaluation of relational keyword search techniques to appear to date in the literature. Our results indicate that many existing search techniques do not provide acceptable performance for realistic retrieval tasks. In particular, memory consumption precludes many search techniques from scaling beyond small data sets with tens of thousands of vertices. We also explore the relationship between execution time and factors varied in previous evaluations; our analysis indicates that most of these factors have relatively little impact on performance. In summary, our work confirms previous claims regarding the unacceptable performance of these search techniques and underscores the need for standardization in evaluations--standardization exemplified by the IR community. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active Learning of Constraints for Semi-Supervised Clustering

    Page(s): 43 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1196 KB)  

    Semi-supervised clustering aims to improve clustering performance by considering user supervision in the form of pairwise constraints. In this paper, we study the active learning problem of selecting pairwise must-link and cannot-link constraints for semi-supervised clustering. We consider active learning in an iterative manner where in each iteration queries are selected based on the current clustering solution and the existing constraint set. We apply a general framework that builds on the concept of neighborhood, where neighborhoods contain "labeled examples" of different clusters according to the pairwise constraints. Our active learning method expands the neighborhoods by selecting informative points and querying their relationship with the neighborhoods. Under this framework, we build on the classic uncertainty-based principle and present a novel approach for computing the uncertainty associated with each data point. We further introduce a selection criterion that trades off the amount of uncertainty of each data point with the expected number of queries (the cost) required to resolve this uncertainty. This allows us to select queries that have the highest information rate. We evaluate the proposed method on the benchmark data sets and the results demonstrate consistent and substantial improvements over the current state of the art. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Approximate Shortest Distance Computing: A Query-Dependent Local Landmark Scheme

    Page(s): 55 - 68
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1350 KB)  

    Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding approach, which selects a set of graph nodes as landmarks and computes the shortest distances from each landmark to all nodes as an embedding. To answer a shortest distance query, the precomputed distances from the landmarks to the two query nodes are used to compute an approximate shortest distance based on the triangle inequality. In this paper, we analyze the factors that affect the accuracy of distance estimation in landmark embedding. In particular, we find that a globally selected, query-independent landmark set may introduce a large relative error, especially for nearby query nodes. To address this issue, we propose a query-dependent local landmark scheme, which identifies a local landmark close to both query nodes and provides more accurate distance estimation than the traditional global landmark approach. We propose efficient local landmark indexing and retrieval techniques, which achieve low offline indexing complexity and online query complexity. Two optimization techniques on graph compression and graph online search are also proposed, with the goal of further reducing index size and improving query accuracy. Furthermore, the challenge of immense graphs whose index may not fit in the memory leads us to store the embedding in relational database, so that a query of the local landmark scheme can be expressed with relational operators. Effective indexing and query optimization mechanisms are designed in this context. Our experimental results on large-scale social networks and road networks demonstrate that the local landmark scheme reduces the shortest distance estimation error significantly when compared with global landmark embedding and the state-of-the-art sketch-based embedding. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Generation of the Domain Module from Electronic Textbooks: Method and Validation

    Page(s): 69 - 82
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1728 KB)  

    Technology-supported learning systems have proved to be helpful in many learning situations. These systems require an appropriate representation of the knowledge to be learned, the Domain Module. The authoring of the Domain Module is cost and labor intensive, but its development cost might be lightened by profiting from semiautomatic Domain Module authoring techniques and promoting knowledge reuse. DOM-Sortze is a system that uses natural language processing techniques, heuristic reasoning, and ontologies for the semiautomatic construction of the Domain Module from electronic textbooks. To determine how it might help in the Domain Module authoring process, it has been tested with an electronic textbook, and the gathered knowledge has been compared with the Domain Module that instructional designers developed manually. This paper presents DOM-Sortze and describes the experiment carried out. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Consensus-Based Ranking of Multivalued Objects: A Generalized Borda Count Approach

    Page(s): 83 - 96
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1968 KB)  

    In this paper, we tackle a novel problem of ranking multivalued objects, where an object has multiple instances in a multidimensional space, and the number of instances per object is not fixed. Given an ad hoc scoring function that assigns a score to a multidimensional instance, we want to rank a set of multivalued objects. Different from the existing models of ranking uncertain and probabilistic data, which model an object as a random variable and the instances of an object are assumed exclusive, we have to capture the coexistence of instances here. To tackle the problem, we advocate the semantics of favoring widely preferred objects instead of majority votes, which is widely used in many elections and competitions. Technically, we borrow the idea from Borda Count (BC), a well-recognized method in consensus-based voting systems. However, Borda Count cannot handle multivalued objects of inconsistent cardinality, and is costly to evaluate top (k) queries on large multidimensional data sets. To address the challenges, we extend and generalize Borda Count to quantile-based Borda Count, and develop efficient computational methods with comprehensive cost analysis. We present case studies on real data sets to demonstrate the effectiveness of the generalized Borda Count ranking, and use synthetic and real data sets to verify the efficiency of our computational method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data mining with big data

    Page(s): 97 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB)  

    Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decision Trees for Mining Data Streams Based on the Gaussian Approximation

    Page(s): 108 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (971 KB)  

    Since the Hoeffding tree algorithm was proposed in the literature, decision trees became one of the most popular tools for mining data streams. The key point of constructing the decision tree is to determine the best attribute to split the considered node. Several methods to solve this problem were presented so far. However, they are either wrongly mathematically justified (e.g., in the Hoeffding tree algorithm) or time-consuming (e.g., in the McDiarmid tree algorithm). In this paper, we propose a new method which significantly outperforms the McDiarmid tree algorithm and has a solid mathematical basis. Our method ensures, with a high probability set by the user, that the best attribute chosen in the considered node using a finite data sample is the same as it would be in the case of the whole data stream. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovering Emerging Topics in Social Streams via Link-Anomaly Detection

    Page(s): 120 - 130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2853 KB)  

    Detection of emerging topics is now receiving renewed interest motivated by the rapid growth of social networks. Conventional-term-frequency-based approaches may not be appropriate in this context, because the information exchanged in social-network posts include not only text but also images, URLs, and videos. We focus on emergence of topics signaled by social aspects of theses networks. Specifically, we focus on mentions of users--links between users that are generated dynamically (intentionally or unintentionally) through replies, mentions, and retweets. We propose a probability model of the mentioning behavior of a social network user, and propose to detect the emergence of a new topic from the anomalies measured through the model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social-network posts. We demonstrate our technique in several real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as text-anomaly-based approaches, and in some cases much earlier when the topic is poorly identified by the textual contents in posts. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ensembles of $({alpha})$-Trees for Imbalanced Classification Problems

    Page(s): 131 - 143
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1687 KB)  

    This paper introduces two kinds of decision tree ensembles for imbalanced classification problems, extensively utilizing properties of α-divergence. First, a novel splitting criterion based on α-divergence is shown to generalize several well-known splitting criteria such as those used in C4.5 and CART. When the α-divergence splitting criterion is applied to imbalanced data, one can obtain decision trees that tend to be less correlated (α-diversification) by varying the value of α. This increased diversity in an ensemble of such trees improves AUROC values across a range of minority class priors. The second ensemble uses the same alpha trees as base classifiers, but uses a lift-aware stopping criterion during tree growth. The resultant ensemble produces a set of interpretable rules that provide higher lift values for a given coverage, a property that is much desirable in applications such as direct marketing. Experimental results across many class-imbalanced data sets, including BRFSS, and MIMIC data sets from the medical community and several sets from UCI and KEEL are provided to highlight the effectiveness of the proposed ensembles over a wide range of data distributions and of class imbalance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Event Characterization and Prediction Based on Temporal Patterns in Dynamic Data System

    Page(s): 144 - 156
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1613 KB)  

    The new method proposed in this paper applies a multivariate reconstructed phase space (MRPS) for identifying multivariate temporal patterns that are characteristic and predictive of anomalies or events in a dynamic data system. The new method extends the original univariate reconstructed phase space framework, which is based on fuzzy unsupervised clustering method, by incorporating a new mechanism of data categorization based on the definition of events. In addition to modeling temporal dynamics in a multivariate phase space, a Bayesian approach is applied to model the first-order Markov behavior in the multidimensional data sequences. The method utilizes an exponential loss objective function to optimize a hybrid classifier which consists of a radial basis kernel function and a log-odds ratio component. We performed experimental evaluation on three data sets to demonstrate the feasibility and effectiveness of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Linkable Ring Signature with Unconditional Anonymity

    Page(s): 157 - 165
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (558 KB)  

    In this paper, we construct a linkable ring signature scheme with unconditional anonymity. It has been regarded as an open problem in [22] since 2004 for the construction of an unconditional anonymous linkable ring signature scheme. We are the first to solve this open problem by giving a concrete instantiation, which is proven secure in the random oracle model. Our construction is even more efficient than other schemes that can only provide computational anonymity. Simultaneously, our scheme can act as an counterexample to show that [19, Theorem 1] is not always true, which stated that linkable ring signature scheme cannot provide strong anonymity. Yet we prove that our scheme can achieve strong anonymity (under one of the interpretations). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining Weakly Labeled Web Facial Images for Search-Based Face Annotation

    Page(s): 166 - 179
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1690 KB)  

    This paper investigates a framework of search-based face annotation (SBFA) by mining weakly labeled facial images that are freely available on the World Wide Web (WWW). One challenging problem for search-based face annotation scheme is how to effectively perform annotation by exploiting the list of most similar facial images and their weak labels that are often noisy and incomplete. To tackle this problem, we propose an effective unsupervised label refinement (ULR) approach for refining the labels of web facial images using machine learning techniques. We formulate the learning problem as a convex optimization and develop effective optimization algorithms to solve the large-scale learning task efficiently. To further speed up the proposed scheme, we also propose a clustering-based approximation algorithm which can improve the scalability considerably. We have conducted an extensive set of empirical studies on a large-scale web facial image testbed, in which encouraging results showed that the proposed ULR algorithms can significantly boost the performance of the promising SBFA scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Privacy-Preserving Enhanced Collaborative Tagging

    Page(s): 180 - 193
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1726 KB)  

    Collaborative tagging is one of the most popular services available online, and it allows end user to loosely classify either online or offline resources based on their feedback, expressed in the form of free-text labels (i.e., tags). Although tags may not be per se sensitive information, the wide use of collaborative tagging services increases the risk of cross referencing, thereby seriously compromising user privacy. In this paper, we make a first contribution toward the development of a privacy-preserving collaborative tagging service, by showing how a specific privacy-enhancing technology, namely tag suppression, can be used to protect end-user privacy. Moreover, we analyze how our approach can affect the effectiveness of a policy-based collaborative tagging system that supports enhanced web access functionalities, like content filtering and discovery, based on preferences specified by end users. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rough Sets, Kernel Set, and Spatiotemporal Outlier Detection

    Page(s): 194 - 207
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2023 KB)  

    Nowadays, the high availability of data gathered from wireless sensor networks and telecommunication systems has drawn the attention of researchers on the problem of extracting knowledge from spatiotemporal data. Detecting outliers which are grossly different from or inconsistent with the remaining spatiotemporal data set is a major challenge in real-world knowledge discovery and data mining applications. In this paper, we deal with the outlier detection problem in spatiotemporal data and describe a rough set approach that finds the top outliers in an unlabeled spatiotemporal data set. The proposed method, called Rough Outlier Set Extraction (ROSE), relies on a rough set theoretic representation of the outlier set using the rough set approximations, i.e., lower and upper approximations. We have also introduced a new set, named Kernel Set, that is a subset of the original data set, which is able to describe the original data set both in terms of data structure and of obtained results. Experimental results on real-world data sets demonstrate the superiority of ROSE, both in terms of some quantitative indices and outliers detected, over those obtained by various rough fuzzy clustering algorithms and by the state-of-the-art outlier detection methods. It is also demonstrated that the kernel set is able to detect the same outliers set but with less computational time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semisupervised Wrapper Choice and Generation for Print-Oriented Documents

    Page(s): 208 - 220
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (555 KB)  

    Information extraction from printed documents is still a crucial problem in many interorganizational workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario well, as printed documents do not carry any explicit structural or syntactical description. Moreover, printed documents usually lack any explicit indication about their source. We present a system, which we call PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists, and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of normal system operation: new wrappers are generated online based on a few point-and-click operations performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very good performance on a challenging data set composed of more than 600 printed documents drawn from three different application domains: invoices, datasheets of electronic components, and patents. We also perform an extensive analysis of the crucial tradeoff between accuracy and automation level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spatially Aware Term Selection for Geotagging

    Page(s): 221 - 234
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1875 KB)  

    The task of assigning geographic coordinates to textual resources plays an increasingly central role in geographic information retrieval. The ability to select those terms from a given collection that are most indicative of geographic location is of key importance in successfully addressing this task. However, this process of selecting spatially relevant terms is at present not well understood, and the majority of current systems are based on standard term selection techniques, such as x2 or information gain, and thus fail to exploit the spatial nature of the domain. In this paper, we propose two classes of term selection techniques based on standard geostatistical methods. First, to implement the idea of spatial smoothing of term occurrences, we investigate the use of kernel density estimation (KDE) to model each term as a two-dimensional probability distribution over the surface of the Earth. The second class of term selection methods we consider is based on Ripley's K statistic, which measures the deviation of a point set from spatial homogeneity. We provide experimental results which compare these classes of methods against existing baseline techniques on the tasks of assigning coordinates to Flickr photos and to Wikipedia articles, revealing marked improvements in cases where only a relatively small number of terms can be selected. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Structural Diversity for Resisting Community Identification in Published Social Networks

    Page(s): 235 - 252
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2368 KB)  

    As an increasing number of social networking data is published and shared for commercial and research purposes, privacy issues about the individuals in social networks have become serious concerns. Vertex identification, which identifies a particular user from a network based on background knowledge such as vertex degree, is one of the most important problems that have been addressed. In reality, however, each individual in a social network is inclined to be associated with not only a vertex identity but also a community identity, which can represent the personal privacy information sensitive to the public, such as political party affiliation. This paper first addresses the new privacy issue, referred to as community identification, by showing that the community identity of a victim can still be inferred even though the social network is protected by existing anonymity schemes. For this problem, we then propose the concept of structural diversity to provide the anonymity of the community identities. The k-Structural Diversity Anonymization (k-SDA) is to ensure sufficient vertices with the same vertex degree in at least k communities in a social network. We propose an Integer Programming formulation to find optimal solutions to k-SDA and also devise scalable heuristics to solve large-scale instances of k-SDA from different perspectives. The performance studies on real data sets from various perspectives demonstrate the practical utility of the proposed privacy scheme and our anonymization approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2013 Reviewers list

    Page(s): 253 - 259
    Save to Project icon | Request Permissions | PDF file iconPDF (130 KB)  
    Freely Available from IEEE
  • IEEE Open Access Publishing

    Page(s): 260
    Save to Project icon | Request Permissions | PDF file iconPDF (418 KB)  
    Freely Available from IEEE
  • 2013 Annual Index

    Page(s): not in print
    Save to Project icon | Request Permissions | PDF file iconPDF (1086 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University