By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 6 • Date June 2013

Filter Results

Displaying Results 1 - 19 of 19
  • A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia

    Page(s): 1201 - 1212
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1358 KB) |  | HTML iconHTML  

    In this paper, we propose a survival modeling approach to promoting ranking diversity for biomedical information retrieval. The proposed approach concerns with finding relevant documents that can deliver more different aspects of a query. First, two probabilistic models derived from the survival analysis theory are proposed for measuring aspect novelty. Second, a new method using Wikipedia to detect aspects covered by retrieved documents is presented. Third, an aspect filter based on a two-stage model is introduced. It ranks the detected aspects in decreasing order of the probability that an aspect is generated by the query. Finally, the relevance and the novelty of retrieved documents are combined at the aspect level for reranking. Experiments conducted on the TREC 2006 and 2007 Genomics collections demonstrate the effectiveness of the proposed approach in promoting ranking diversity for biomedical information retrieval. Moreover, we further evaluate our approach in the Web retrieval environment. The evaluation results on the ClueWeb09-T09B collection show that our approach can achieve promising performance improvements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Centroid-Based Actionable 3D Subspace Clustering

    Page(s): 1213 - 1226
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3043 KB) |  | HTML iconHTML  

    Actionable 3D subspace clustering from real-world continuous-valued 3D (i.e., object-attribute-context) data promises tangible benefits such as discovery of biologically significant protein residues and profitable stocks, but existing algorithms are inadequate in solving this clustering problem; most of them are not actionable (ability to suggest profitable or beneficial actions to users), do not allow incorporation of domain knowledge, and are parameter sensitive, i.e., the wrong threshold setting reduces the cluster quality. Moreover, its 3D structure complicates this clustering problem. We propose a centroid-based actionable 3D subspace clustering framework, named CATSeeker, which allows incorporation of domain knowledge, and achieves parameter insensitivity and excellent performance through a unique combination of singular value decomposition, numerical optimization, and 3D frequent itemset mining. Experimental results on synthetic, protein structural, and financial data show that CATSeeker significantly outperforms all the competing methods in terms of efficiency, parameter insensitivity, and cluster usefulness. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Constrained Text Coclustering with Supervised and Unsupervised Constraints

    Page(s): 1227 - 1239
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1358 KB) |  | HTML iconHTML  

    In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine information-theoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of our approaches against a number of existing approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Crowdsourced Trace Similarity with Smartphones

    Page(s): 1240 - 1253
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3132 KB) |  | HTML iconHTML  

    Smartphones are nowadays equipped with a number of sensors, such as WiFi, GPS, accelerometers, etc. This capability allows smartphone users to easily engage in crowdsourced computing services, which contribute to the solution of complex problems in a distributed manner. In this work, we leverage such a computing paradigm to solve efficiently the following problem: comparing a query trace Q against a crowd of traces generated and stored on distributed smartphones. Our proposed framework, coined SmartTrace+, provides an effective solution without disclosing any part of the crowd traces to the query processor. SmartTrace+, relies on an in-situ data storage model and intelligent top-K query processing algorithms that exploit distributed trajectory similarity measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to Q. We evaluate our algorithms on both synthetic and real workloads. We describe our prototype system developed on the Android OS. The solution is deployed over our own SmartLab testbed of 25 smartphones. Our study reveals that computations over SmartTrace+ result in substantial energy conservation; in addition, results can be computed faster than competitive approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Customized Policies for Handling Partial Information in Relational Databases

    Page(s): 1254 - 1271
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2232 KB)  

    Most real-world databases have at least some missing data. Today, users of such databases are “on their own” in terms of how they manage this incompleteness. In this paper, we propose the general concept of partial information policy (PIP) operator to handle incompleteness in relational databases. PIP operators build upon preference frameworks for incomplete information, but accommodate different types of incomplete data (e.g., a value exists but is not known; a value does not exist; a value may or may not exist). Different users in the real world have different ways in which they want to handle incompleteness-PIP operators allow them to specify a policy that matches their attitude to risk and their knowledge of the application and how the data was collected. We propose index structures for efficiently evaluating PIP operators and experimentally assess their effectiveness on a real-world airline data set. We also study how relational algebra operators and PIP operators interact with one another. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decision Trees for Mining Data Streams Based on the McDiarmid's Bound

    Page(s): 1272 - 1279
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (901 KB)  

    In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the property that its output is nearly identical to that of a conventional learner. The results of the paper have a great impact on the state of the art of mining data streams and various developed so far methods and algorithms should be reconsidered. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovering Characterizations of the Behavior of Anomalous Subpopulations

    Page(s): 1280 - 1292
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1570 KB) |  | HTML iconHTML  

    We consider the problem of discovering attributes, or properties, accounting for the a priori stated abnormality of a group of anomalous individuals (the outliers) with respect to an overall given population (the inliers). To this aim, we introduce the notion of exceptional property and define the concept of exceptionality score, which measures the significance of a property. In particular, in order to single out exceptional properties, we resort to a form of minimum distance estimation for evaluating the badness of fit of the values assumed by the outliers compared to the probability distribution associated with the values assumed by the inliers. Suitable exceptionality scores are introduced for both numeric and categorical attributes. These scores are, both from the analytical and the empirical point of view, designed to be effective for small samples, as it is the case for outliers. We present an algorithm, called EXPREX, for efficiently discovering exceptional properties. The algorithm is able to reduce the needed computational effort by not exploring many irrelevant numerical intervals and by exploiting suitable pruning rules. The experimental results confirm that our technique is able to provide knowledge characterizing outliers in a natural manner. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FoCUS: Learning to Crawl Web Forums

    Page(s): 1293 - 1306
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1984 KB)  

    In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL-type recognition problem. And we show how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying FoCUS on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy

    Page(s): 1307 - 1322
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2685 KB) |  | HTML iconHTML  

    Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, PMImax, that augments PMI with information about a word's number of senses. The coefficients of PMImax are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that it outperforms traditional PMI in the application of automatic thesaurus generation and in two word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. PMImax achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating data set. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incentive Compatible Privacy-Preserving Data Analysis

    Page(s): 1323 - 1335
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (689 KB) |  | HTML iconHTML  

    In many cases, competing parties who have private data may collaboratively conduct privacy-preserving distributed data analysis (PPDA) tasks to learn beneficial data models or analysis results. Most often, the competing parties have different incentives. Although certain PPDA techniques guarantee that nothing other than the final analysis result is revealed, it is impossible to verify whether participating parties are truthful about their private input data. Unless proper incentives are set, current PPDA techniques cannot prevent participating parties from modifying their private inputs.incentive compatible privacy-preserving data analysis techniques This raises the question of how to design incentive compatible privacy-preserving data analysis techniques that motivate participating parties to provide truthful inputs. In this paper, we first develop key theorems, then base on these theorems, we analyze certain important privacy-preserving data analysis tasks that could be conducted in a way that telling the truth is the best choice for any participating party. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonnegative Matrix Factorization: A Comprehensive Review

    Page(s): 1336 - 1353
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (511 KB) |  | HTML iconHTML  

    Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as enhancing the interpretability of the issue correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last 5 years, where the principles, basic models, properties, and algorithms of NMF along with its various modifications, extensions, and generalizations are summarized systematically. The existing NMF algorithms are divided into four categories: Basic NMF (BNMF), Constrained NMF (CNMF), Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles, characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed comprehensively. Some related work not on NMF that NMF should learn from or has connections with is involved too. Moreover, some open issues remained to be solved are discussed. Several relevant application areas of NMF are also briefly described. This survey aims to construct an integrated, state-of-the-art framework for NMF concept, from which the follow-up research may benefit. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Identifying Critical Nuggets of Information during Classification Tasks

    Page(s): 1354 - 1367
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3652 KB) |  | HTML iconHTML  

    In large databases, there may exist critical nuggets-small collections of records or instances that contain domain-specific important information. This information can be used for future decision making such as labeling of critical, unlabeled data records and improving classification results by reducing false positive and false negative errors. This work introduces the idea of critical nuggets, proposes an innovative domain-independent method to measure criticality, suggests a heuristic to reduce the search space for finding critical nuggets, and isolates and validates critical nuggets from some real-world data sets. It seems that only a few subsets may qualify to be critical nuggets, underlying the importance of finding them. The proposed methodology can detect them. This work also identifies certain properties of critical nuggets and provides experimental validation of the properties. Experimental results also helped validate that critical nuggets can assist in improving classification accuracies in real-world data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Radio Database Compression for Accurate Energy-Efficient Localization in Fingerprinting Systems

    Page(s): 1368 - 1379
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1932 KB) |  | HTML iconHTML  

    Location fingerprinting is a positioning method that exploits the already existing infrastructures such as cellular networks or WLANs. Regarding the recent demand for energy efficient networks and the emergence of issues like green networking, we propose a clustering technique to compress the radio database in the context of cellular fingerprinting systems. The aim of the proposed technique is to reduce the computation cost and transmission load in the mobile-based implementations. The presented method may be called Block-based Weighted Clustering (BWC) technique, which is applied in a concatenated location-radio signal space, and attributes different weight factors to the location and radio components. Computer simulations and real experiments have been conducted to evaluate the performance of our proposed technique in the context of a GSM network. The obtained results confirm the efficiency of the BWC technique, and show that it improves the performance of standard k-means and hierarchical clustering methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semi-Supervised Nonlinear Hashing Using Bootstrap Sequential Projection Learning

    Page(s): 1380 - 1393
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1173 KB) |  | HTML iconHTML  

    In this paper, we study the effective semi-supervised hashing method under the framework of regularized learning-based hashing. A nonlinear hash function is introduced to capture the underlying relationship among data points. Thus, the dimensionality of the matrix for computation is not only independent from the dimensionality of the original data space but also much smaller than the one using linear hash function. To effectively deal with the error accumulated during converting the real-value embeddings into the binary code after relaxation, we propose a semi-supervised nonlinear hashing algorithm using bootstrap sequential projection learning which effectively corrects the errors by taking into account of all the previous learned bits holistically without incurring the extra computational overhead. Experimental results on the six benchmark data sets demonstrate that the presented method outperforms the state-of-the-art hashing algorithms at a large margin. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spatial Approximate String Search

    Page(s): 1394 - 1409
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1586 KB)  

    This work deals with the approximate string search in large spatial databases. Specifically, we investigate range queries augmented with a string similarity search predicate in both euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the subtree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the subtrees of index nodes. We also discuss how to estimate the selectivity of a SAS query in euclidean space, for which we present a novel adaptive algorithm to find balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which significantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram-based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SVStream: A Support Vector-Based Algorithm for Clustering Data Streams

    Page(s): 1410 - 1424
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3705 KB) |  | HTML iconHTML  

    In this paper, we propose a novel data stream clustering algorithm, termed SVStream, which is based on support vector domain description and support vector clustering. In the proposed algorithm, the data elements of a stream are mapped into a kernel space, and the support vectors are used as the summary information of the historical elements to construct cluster boundaries of arbitrary shape. To adapt to both dramatic and gradual changes, multiple spheres are dynamically maintained, each describing the corresponding data domain presented in the data stream. By allowing for bounded support vectors (BSVs), the proposed SVStream algorithm is capable of identifying overlapping clusters. A BSV decaying mechanism is designed to automatically detect and remove outliers (noise). We perform experiments over synthetic and real data streams, with the overlapping, evolving, and noise situations taken into consideration. Comparison results with state-of-the-art data stream clustering methods demonstrate the effectiveness and efficiency of the proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Move-Split-Merge Metric for Time Series

    Page(s): 1425 - 1438
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2118 KB)  

    A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time series. A Move operation changes the value of a single element, a Split operation converts a single element into two consecutive elements, and a Merge operation merges two consecutive elements into one. Each operation has an associated cost, and the MSM distance between two time series is defined to be the cost of the cheapest sequence of operations that transforms the first time series into the second one. An efficient, quadratic-time algorithm is provided for computing the MSM distance. MSM has the desirable properties of being metric, in contrast to the Dynamic Time Warping (DTW) distance, and invariant to the choice of origin, in contrast to the Edit Distance with Real Penalty (ERP) metric. At the same time, experiments with public time series data sets demonstrate that MSM is a meaningful distance measure, that oftentimes leads to lower nearest neighbor classification error rate compared to DTW and ERP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A User-Friendly Patent Search Paradigm

    Page(s): 1439 - 1443
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (619 KB) |  | HTML iconHTML  

    As an important operation for finding existing relevant patents and validating a new patent application, patent search has attracted considerable attention recently. However, many users have limited knowledge about the underlying patents, and they have to use a try-and-see approach to repeatedly issue different queries and check answers, which is a very tedious process. To address this problem, in this paper, we propose a new user-friendly patent search paradigm, which can help users find relevant patents more easily and improve user search experience. We propose three effective techniques, error correction, topic-based query suggestion, and query expansion, to improve the usability of patent search. We also study how to efficiently find relevant answers from a large collection of patents. We first partition patents into small partitions based to their topics and classes. Then, given a query, we find highly relevant partitions and answer the query in each of such highly relevant partitions. Finally, we combine the answers of each partition and generate top-k answers of the patent-search query. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Open Access Publishing

    Page(s): 1444
    Save to Project icon | Request Permissions | PDF file iconPDF (418 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University