By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 6 • Date June 2012

Filter Results

Displaying Results 1 - 21 of 21
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (203 KB)  
    Freely Available from IEEE
  • A Knowledge-Driven Approach to Activity Recognition in Smart Homes

    Page(s): 961 - 974
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1305 KB) |  | HTML iconHTML  

    This paper introduces a knowledge-driven approach to real-time, continuous activity recognition based on multisensor data streams in smart homes. The approach goes beyond the traditional data-centric methods for activity recognition in three ways. First, it makes extensive use of domain knowledge in the life cycle of activity recognition. Second, it uses ontologies for explicit context and activity modeling and representation. Third and finally, it exploits semantic reasoning and classification for activity inferencing, thus enabling both coarse-grained and fine-grained activity recognition. In this paper, we analyze the characteristics of smart homes and Activities of Daily Living (ADL) upon which we built both context and ADL ontologies. We present a generic system architecture for the proposed knowledge-driven approach and describe the underlying ontology-based recognition process. Special emphasis is placed on semantic subsumption reasoning algorithms for activity recognition. The proposed approach has been implemented in a function-rich software system, which was deployed in a smart home research laboratory. We evaluated the proposed approach and the developed system through extensive experiments involving a number of various ADL use scenarios. An average activity recognition rate of 94.44 percent was achieved and the average recognition runtime per recognition operation was measured as 2.5 seconds. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Unified Probabilistic Framework for Name Disambiguation in Digital Library

    Page(s): 975 - 987
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2085 KB) |  | HTML iconHTML  

    Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to capture all information for name disambiguation in a unified approach, and how to determine the number of people K in the disambiguation process. In this paper, we formalize the problem in a unified probabilistic framework, which incorporates both attributes and relationships. Specifically, we define a disambiguation objective function for the problem and propose a two-step parameter estimation algorithm. We also investigate a dynamic approach for estimating the number of people K. Experiments show that our proposed framework significantly outperforms four baseline methods of using clustering algorithms and two other previous methods. Experiments also indicate that the number K automatically found by our method is close to the actual number. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Clustering with Multiviewpoint-Based Similarity Measure

    Page(s): 988 - 1001
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1918 KB) |  | HTML iconHTML  

    All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Document Clustering in Correlation Similarity Measure Space

    Page(s): 1002 - 1013
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3244 KB) |  | HTML iconHTML  

    This paper presents a new spectral clustering method called correlation preserving indexing (CPI), which is performed in the correlation similarity measure space. In this framework, the documents are projected into a low-dimensional semantic space in which the correlations between the documents in the local patches are maximized while the correlations between the documents outside these patches are minimized simultaneously. Since the intrinsic geometrical structure of the document space is often embedded in the similarities between the documents, correlation as a similarity measure is more suitable for detecting the intrinsic geometrical structure of the document space than euclidean distance. Consequently, the proposed CPI method can effectively discover the intrinsic structures embedded in high-dimensional document space. The effectiveness of the new method is demonstrated by extensive experiments conducted on various data sets and by comparison with existing document clustering methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Extended Boolean Retrieval

    Page(s): 1014 - 1024
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (976 KB) |  | HTML iconHTML  

    Extended Boolean retrieval (EBR) models were proposed nearly three decades ago, but have had little practical impact, despite their significant advantages compared to either ranked keyword or pure Boolean retrieval. In particular, EBR models produce meaningful rankings; their query model allows the representation of complex concepts in an and-or format; and they are scrutable, in that the score assigned to a document depends solely on the content of that document, unaffected by any collection statistics or other external factors. These characteristics make EBR models attractive in domains typified by medical and legal searching, where the emphasis is on iterative development of reproducible complex queries of dozens or even hundreds of terms. However, EBR is much more computationally expensive than the alternatives. We consider the implementation of the p-norm approach to EBR, and demonstrate that ideas used in the max-score and wand exact optimization techniques for ranked keyword retrieval can be adapted to allow selective bypass of documents via a low-cost screening process for this and similar retrieval models. We also propose term-independent bounds that are able to further reduce the number of score calculations for short, simple queries under the extended Boolean retrieval model. Together, these methods yield an overall saving from 50 to 80 percent of the evaluation cost on test queries drawn from biomedical search. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Locally Discriminative Coclustering

    Page(s): 1025 - 1035
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1067 KB) |  | HTML iconHTML  

    Different from traditional one-sided clustering techniques, coclustering makes use of the duality between samples and features to partition them simultaneously. Most of the existing co-clustering algorithms focus on modeling the relationship between samples and features, whereas the intersample and interfeature relationships are ignored. In this paper, we propose a novel coclustering algorithm named Locally Discriminative Coclustering (LDCC) to explore the relationship between samples and features as well as the intersample and interfeature relationships. Specifically, the sample-feature relationship is modeled by a bipartite graph between samples and features. And we apply local linear regression to discovering the intrinsic discriminative structures of both sample space and feature space. For each local patch in the sample and feature spaces, a local linear function is estimated to predict the labels of the points in this patch. The intersample and interfeature relationships are thus captured by minimizing the fitting errors of all the local linear functions. In this way, LDCC groups strongly associated samples and features together, while respecting the local structures of both sample and feature spaces. Our experimental results on several benchmark data sets have demonstrated the effectiveness of the proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Rank Kernel Matrix Factorization for Large-Scale Evolutionary Clustering

    Page(s): 1036 - 1050
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1671 KB) |  | HTML iconHTML  

    Traditional clustering techniques are inapplicable to problems where the relationships between data points evolve over time. Not only is it important for the clustering algorithm to adapt to the recent changes in the evolving data, but it also needs to take the historical relationship between the data points into consideration. In this paper, we propose ECKF, a general framework for evolutionary clustering large-scale data based on low-rank kernel matrix factorization. To the best of our knowledge, this is the first work that clusters large evolutionary data sets by the amalgamation of low-rank matrix approximation methods and matrix factorization-based clustering. Since the low-rank approximation provides a compact representation of the original matrix, and especially, the near-optimal low-rank approximation can preserve the sparsity of the original data, ECKF gains computational efficiency and hence is applicable to large evolutionary data sets. Moreover, matrix factorization-based methods have been shown to effectively cluster high-dimensional data in text mining and multimedia data analysis. From a theoretical standpoint, we mathematically prove the convergence and correctness of ECKF, and provide detailed analysis of its computational efficiency (both time and space). Through extensive experiments performed on synthetic and real data sets, we show that ECKF outperforms the existing methods in evolutionary clustering. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining Web Graphs for Recommendations

    Page(s): 1051 - 1064
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1935 KB) |  | HTML iconHTML  

    As the exponential explosion of various contents generated on the Web, Recommendation techniques have become increasingly indispensable. Innumerable different kinds of recommendations are made on the Web every day, including movies, music, images, books recommendations, query suggestions, tags recommendations, etc. No matter what types of data sources are used for the recommendations, essentially these data sources can be modeled in the form of various types of graphs. In this paper, aiming at providing a general framework on mining Web graphs for recommendations, (1) we first propose a novel diffusion method which propagates similarities between different nodes and generates recommendations; (2) then we illustrate how to generalize different recommendation problems into our graph diffusion framework. The proposed framework can be utilized in many recommendation tasks on the World Wide Web, including query suggestions, tag recommendations, expert finding, image recommendations, image annotations, etc. The experimental analysis on large data sets shows the promising future of our work. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Query Planning for Continuous Aggregation Queries over a Network of Data Aggregators

    Page(s): 1065 - 1079
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (862 KB) |  | HTML iconHTML  

    Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some aggregation function over distributed data items, for example, to know value of portfolio for a client; or the AVG of temperatures sensed by a set of sensors. In these queries a client specifies a coherency requirement as part of the query. We present a low-cost, scalable technique to answer continuous aggregation queries using a network of aggregators of dynamic data items. In such a network of data aggregators, each data aggregator serves a set of data items at specific coherencies. Just as various fragments of a dynamic webpage are served by one or more nodes of a content distribution network, our technique involves decomposing a client query into subqueries and executing subqueries on judiciously chosen data aggregators with their individual subquery incoherency bounds. We provide a technique for getting the optimal set of subqueries with their incoherency bounds which satisfies client query's coherency requirement with least number of refresh messages sent from aggregators to the client. For estimating the number of refresh messages, we build a query cost model which can be used to estimate the number of messages required to satisfy the client specified incoherency bound. Performance results using real-world traces show that our cost-based query planning leads to queries being executed using less than one third the number of messages required by existing schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Learning of Collective Behavior

    Page(s): 1080 - 1091
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1894 KB) |  | HTML iconHTML  

    This study of collective behavior is to understand how individuals behave in a social networking environment. Oceans of data generated by social media like Facebook, Twitter, Flickr, and YouTube present opportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predict collective behavior in social media. In particular, given information about some individuals, how can we infer the behavior of unobserved individuals in the same network? A social-dimension-based approach has been shown effective in addressing the heterogeneity of connections presented in social media. However, the networks in social media are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entails scalable learning of models for collective behavior prediction. To address the scalability issue, we propose an edge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposed approach can efficiently handle networks of millions of actors while demonstrating a comparable prediction performance to other nonscalable methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Scheduling of Updates in Streaming Data Warehouses

    Page(s): 1092 - 1105
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (883 KB) |  | HTML iconHTML  

    We discuss update scheduling in streaming data warehouses, which combine the features of traditional data warehouses and data stream systems. In our setting, external sources push append-only data streams into the warehouse with a wide range of interarrival times. While traditional data warehouses are typically refreshed during downtimes, streaming warehouses are updated as new data arrive. We model the streaming warehouse update problem as a scheduling problem, where jobs correspond to processes that load new data into tables, and whose objective is to minimize data staleness over time (at time t, if a table has been updated with information up to some earlier time r, its staleness is t minus r). We then propose a scheduling framework that handles the complications encountered by a stream warehouse: view hierarchies and priorities, data consistency, inability to preempt updates, heterogeneity of update jobs caused by different interarrival times and data volumes among different sources, and transient overload. A novel feature of our framework is that scheduling decisions do not depend on properties of update jobs (such as deadlines), but rather on the effect of update jobs on data staleness. Finally, we present a suite of update scheduling algorithms and extensive simulation experiments to map out factors which affect their performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using Rule Ontology in Repeated Rule Acquisition from Similar Web Sites

    Page(s): 1106 - 1119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2736 KB) |  | HTML iconHTML  

    Inferential rules are as essential to the Semantic Web applications as ontology. Therefore, rule acquisition is also an important issue, and the Web that implies inferential rules can be a major source of rule acquisition. We expect that it will be easier to acquire rules from a site by using similar rules of other sites in the same domain rather than starting from scratch. We proposed an automatic rule acquisition procedure using a rule ontology RuleToOnto, which represents information about the rule components and their structures. The rule acquisition procedure consists of the rule component identification step and the rule composition step. We developed A* algorithm for the rule composition and we performed experiments demonstrating that our ontology-based rule acquisition approach works in a real-world application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Visual Role Mining: A Picture Is Worth a Thousand Roles

    Page(s): 1120 - 1133
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1678 KB) |  | HTML iconHTML  

    This paper offers a new role engineering approach to Role-Based Access Control (RBAC), referred to as visual role mining. The key idea is to graphically represent user-permission assignments to enable quick analysis and elicitation of meaningful roles. First, we formally define the problem by introducing a metric for the quality of the visualization. Then, we prove that finding the best representation according to the defined metric is a NP-hard problem. In turn, we propose two algorithms: ADVISER and EXTRACT. The former is a heuristic used to best represent the user-permission assignments of a given set of roles. The latter is a fast probabilistic algorithm that, when used in conjunction with ADVISER, allows for a visual elicitation of roles even in absence of predefined roles. Besides being rooted in sound theory, our proposal is supported by extensive simulations run over real data. Results confirm the quality of the proposal and demonstrate its viability in supporting role engineering decisions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Weakly Supervised Joint Sentiment-Topic Detection from Text

    Page(s): 1134 - 1145
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1414 KB) |  | HTML iconHTML  

    Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework called joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. A reparameterized version of the JST model called Reverse-JST, obtained by reversing the sequence of sentiment and topic generation in the modeling process, is also studied. Although JST is equivalent to Reverse-JST without a hierarchical prior, extensive experiments show that when sentiment priors are added, JST performs consistently better than Reverse-JST. Besides, unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when shifting to other domains, the weakly supervised nature of JST makes it highly portable to other domains. This is verified by the experimental results on data sets from five different domains where the JST model even outperforms existing semi-supervised approaches in some of the data sets despite using no labeled documents. Moreover, the topics and topic sentiment detected by JST are indeed coherent and informative. We hypothesize that the JST model can readily meet the demand of large-scale sentiment analysis from the web in an open-ended fashion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm

    Page(s): 1146 - 1150
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1038 KB)  

    Unsupervised techniques like clustering may be used for fault prediction in software modules, more so in those cases where fault labels are not available. In this paper a Quad Tree-based K-Means algorithm has been applied for predicting faults in program modules. The aims of this paper are twofold. First, Quad Trees are applied for finding the initial cluster centers to be input to the A'-Means Algorithm. An input threshold parameter δ governs the number of initial cluster centers and by varying δ the user can generate desired initial cluster centers. The concept of clustering gain has been used to determine the quality of clusters for evaluation of the Quad Tree-based initialization algorithm as compared to other initialization techniques. The clusters obtained by Quad Tree-based algorithm were found to have maximum gain values. Second, the Quad Tree- based algorithm is applied for predicting faults in program modules. The overall error rates of this prediction approach are compared to other existing algorithms and are found to be better in most of the cases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • What's new in Transactions [advertisement]

    Page(s): 1151
    Save to Project icon | Request Permissions | PDF file iconPDF (348 KB)  
    Freely Available from IEEE
  • Stay Connected with the IEEE Computer Society [advertisement]

    Page(s): 1152
    Save to Project icon | Request Permissions | PDF file iconPDF (300 KB)  
    Freely Available from IEEE
  • [Inside back cover]

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (203 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University