By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 3 • Date March 2009

Filter Results

Displaying Results 1 - 16 of 16
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (158 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (125 KB)  
    Freely Available from IEEE
  • Improving Personalization Solutions through Optimal Segmentation of Customer Bases

    Page(s): 305 - 320
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2205 KB) |  | HTML iconHTML  

    On the Web, where the search costs are low and the competition is just a mouse click away, it is crucial to segment the customers intelligently in order to offer more targeted and personalized products and services to them. Traditionally, customer segmentation is achieved using statistics-based methods that compute a set of statistics from the customer data and group customers into segments by applying distance-based clustering algorithms in the space of these statistics. In this paper, we present a direct grouping-based approach to computing customer segments that groups customers not based on computed statistics, but in terms of optimally combining transactional data of several customers to build a data mining model of customer behavior for each group. Then, building customer segments becomes a combinatorial optimization problem of finding the best partitioning of the customer base into disjoint groups. This paper shows that finding an optimal customer partition is NP-hard, proposes several suboptimal direct grouping segmentation methods, and empirically compares them among themselves, traditional statistics-based hierarchical and affinity propagation-based segmentation, and one-to-one methods across multiple experimental conditions. It is shown that the best direct grouping method significantly dominates the statistics-based and one-to-one approaches across most of the experimental conditions, while still being computationally tractable. It is also shown that the distribution of the sizes of customer segments generated by the best direct grouping method follows a power law distribution and that microsegmentation provides the best approach to personalization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective and Efficient Query Processing for Video Subsequence Identification

    Page(s): 321 - 334
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1711 KB) |  | HTML iconHTML  

    With the growing demand for visual information of rich content, effective and efficient manipulations of large video databases are increasingly desired. Many investigations have been made on content-based video retrieval. However, despite the importance, video subsequence identification, which is to find the similar content to a short query clip from a long video sequence, has not been well addressed. This paper presents a graph transformation and matching approach to this problem, with extension to identify the occurrence of potentially different ordering or length due to content editing. With a novel batch query algorithm to retrieve similar frames, the mapping relationship between the query and database video is first represented by a bipartite graph. The densely matched parts along the long sequence are then extracted, followed by a filter-and-refine search strategy to prune some irrelevant subsequences. During the filtering stage, maximum size matching is deployed for each subgraph constructed by the query and candidate subsequence to obtain a smaller set of candidates. During the refinement stage, sub-maximum similarity matching is devised to identify the subsequence with the highest aggregate score from all candidates, according to a robust video similarity model that incorporates visual content, temporal order, and frame alignment information. The performance studies conducted on a long video recording of 50 hours validate that our approach is promising in terms of both search accuracy and speed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatically Determining the Number of Clusters in Unlabeled Data Sets

    Page(s): 335 - 350
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5263 KB) |  | HTML iconHTML  

    Clustering is a popular tool for exploratory data analysis. One of the major problems in cluster analysis is the determination of the number of clusters in unlabeled data, which is a basic input for most clustering algorithms. In this paper we investigate a new method called DBE (dark block extraction) for automatically estimating the number of clusters in unlabeled data sets, which is based on an existing algorithm for visual assessment of cluster tendency (VAT) of a data set, using several common image and signal processing techniques. Basic steps include: 1) generating a VAT image of an input dissimilarity matrix; 2) performing image segmentation on the VAT image to obtain a binary image, followed by directional morphological filtering; 3) applying a distance transform to the filtered binary image and projecting the pixel values onto the main diagonal axis of the image to form a projection signal; 4) smoothing the projection signal, computing its first-order derivative, and then detecting major peaks and valleys in the resulting signal to decide the number of clusters. Our new DBE method is nearly "automatic", depending on just one easy-to-set parameter. Several numerical and real-world examples are presented to illustrate the effectiveness of DBE. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Processing of Metric Skyline Queries

    Page(s): 351 - 365
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2942 KB) |  | HTML iconHTML  

    Skyline query is of great importance in many applications, such as multi-criteria decision making and business planning. In particular, a skyline point is a data object in the database whose attribute vector is not dominated by that of any other objects. Previous methods to retrieve skyline points usually assume static data objects in the database (i.e. their attribute vectors are fixed), whereas several recent work focus on skyline queries with dynamic attributes. In this paper, we propose a novel variant of skyline queries, namely metric skyline, whose dynamic attributes are defined in the metric space (i.e. not limited to the Euclidean space). We illustrate an efficient and effective pruning mechanism to answer metric skyline queries through a metric index. Most importantly, we formalize the query performance of the metric skyline query in terms of the pruning power, by a cost model, in light of which we construct an optimized metric index aiming to maximize the pruning power of metric skyline queries. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed pruning techniques as well as the constructed index in answering metric skyline queries. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Effect of Location Uncertainty in Spatial Querying

    Page(s): 366 - 383
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3373 KB) |  | HTML iconHTML  

    An emerging topic in the field of spatial data management is the handling of location uncertainty of spatial objects, mainly due to inaccurate measurements. The literature on location uncertainty so far has focused on modifying traditional spatial search algorithms in order to handle the impact of objects' location uncertainty in query results. In this paper, we present the first, to the best of our knowledge, theoretical analysis that estimates the average number of false hits introduced in the results of rectangular range queries in the case of data points uniformly distributed in 2D space. Then, we relax the original distribution assumptions showing how to deal with arbitrarily distributed data points and more realistic location uncertainty distributions. The accuracy of the results of our analytical approach is demonstrated through an extensive experimental study using various synthetic and real datasets. Our proposal can be directly employed in spatial database systems in order to provide users with the accuracy of spatial query results based only on known dataset and query parameters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed Skyline Retrieval with Low Bandwidth Consumption

    Page(s): 384 - 400
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3530 KB) |  | HTML iconHTML  

    We consider skyline computation when the underlying data set is horizontally partitioned onto geographically distant servers that are connected to the Internet. The existing solutions are not suitable for our problem, because they have at least one of the following drawbacks: (1) applicable only to distributed systems adopting vertical partitioning or restricted horizontal partitioning, (2) effective only when each server has limited computing and communication abilities, and (3) optimized only for skyline search in subspaces but inefficient in the full space. This paper proposes an algorithm, called feedback-based distributed skyline (FDS), to support arbitrary horizontal partitioning. FDS aims at minimizing the network bandwidth, measured in the number of tuples transmitted over the network. The core of FDS is a novel feedback-driven mechanism, where the coordinator iteratively transmits certain feedback to each participant. Participants can leverage such information to prune a large amount of local data, which otherwise would need to be sent to the coordinator. Extensive experimentation confirms that FDS significantly outperforms alternative approaches in both effectiveness and progressiveness. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • semQA: SPARQL with Idempotent Disjunction

    Page(s): 401 - 414
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2039 KB) |  | HTML iconHTML  

    The SPARQL LeftJoin abstract operator is not distributive over Union; this limits the algebraic manipulation of graph patterns, which in turn restricts the ability to create query plans for distributed processing or query optimization. In this paper, we present semQA, an algebraic extension for the SPARQL query language for RDF, which overcomes this issue by transforming graph patterns through the use of an idempotent disjunction operator Or as a substitute for Union. This permits the application of a set of equivalences that transform a query into distinct forms. We further present an algorithm to derive the solution set of the original query from the solution set of a query where Union has been substituted by Or. We also analyze the combined complexity of SPARQL, proving it to be NP-complete. It is also shown that the SPARQL query language is not, in the general case, fixed-parameter tractable. Experimental results are presented to validate the query evaluation methodology presented in this paper against the SPARQL standard, to corroborate the complexity analysis, and to illustrate the gains in processing cost reduction that can be obtained through the application of semQA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Detecting, Assessing and Monitoring Relevant Topics in Virtual Information Environments

    Page(s): 415 - 427
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1341 KB) |  | HTML iconHTML  

    The ability to assess the relevance of topics and related sources in information-rich environments is a key to success when scanning business environments. This paper introduces a hybrid system to support managerial information gathering. The system is made up of three components: 1) a hierarchical hyperbolic SOM for structuring the information environment and visualizing the intensity of news activity with respect to identified topics, 2) a spreading activation network for the selection of the most relevant information sources with respect to an already existing knowledge infrastructure, and 3) measures of interestingness for association rules as well as statistical testing facilitates the monitoring of already identified topics. Embedding the system by a framework describing three modes of human information seeking behavior endorses an active organization, exploration and selection of information that matches the needs of decision makers in all stages of the information gathering process. By applying our system in the domain of the hotel industry we demonstrate how typical information gathering tasks are supported. Moreover, we present an empirical study investigating the effectiveness and efficiency of the visualization framework of our system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributional Features for Text Categorization

    Page(s): 428 - 442
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5172 KB) |  | HTML iconHTML  

    Text categorization is the task of assigning predefined categories to natural language text. With the widely used 'bag of words' representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tf idf style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Development of Fuzzy Rough Sets with the Use of Structures and Algebras of Axiomatic Fuzzy Sets

    Page(s): 443 - 462
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1677 KB) |  | HTML iconHTML  

    The notion of a rough set was originally proposed by Pawlak underwent a number of extensions and generalizations. Dubois and Prade (1990) introduced fuzzy rough sets which involve the use of rough sets and fuzzy sets within a single framework. Radzikowska and Kerre (2002) proposed a broad family of fuzzy rough sets, referred to as (phi, t)-fuzzy rough sets which are determined by some implication operator (implicator) phi and a certain t-norm. In order to describe the linguistically represented concepts coming from data available in some information system, the concept of fuzzy rough sets are redefined and further studied in the setting of the axiomatic fuzzy set (AFS) theory. Compared with the (phi, t)-fuzzy rough sets, the advantages of AFS fuzzy rough sets are twofold. They can be directly applied to data analysis present in any information system without resorting to the details concerning the choice of the implication phi, t-norm and a similarity relation S. Furthermore such rough approximations of fuzzy concepts come with a well-defined semantics and therefore offer a sound interpretation. Some examples are included to illustrate the effectiveness of the proposed construct. It is shown that the AFS fuzzy rough sets provide a far higher flexibility and effectiveness in comparison with rough sets and some of their generalizations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Computer Society Career Center

    Page(s): 463
    Save to Project icon | Request Permissions | PDF file iconPDF (308 KB)  
    Freely Available from IEEE
  • Join the IEEE Computer Society [advertisement]

    Page(s): 464
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (125 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (158 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University