By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 9 • Date Sept. 2008

Filter Results

Displaying Results 1 - 17 of 17
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (104 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • A General Model for Sequential Pattern Mining with a Progressive Database

    Page(s): 1153 - 1167
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3271 KB) |  | HTML iconHTML  

    Although there have been many recent studies on the mining of sequential patterns in a static database and in a database with increasing data, these works, in general, do not fully explore the effect of deleting old data from the sequences in the database. When sequential patterns are generated, the newly arriving patterns may not be identified as frequent sequential patterns due to the existence of old data and sequences. Even worse, the obsolete sequential patterns that are not frequent recently may stay in the reported results. In practice, users are usually more interested in the recent data than the old ones. To capture the dynamic nature of data addition and deletion, we propose a general model of sequential pattern mining with a progressive database while the data in the database may be static, inserted, or deleted. In addition, we present a progressive algorithm Pisa, which stands for progressive mining of sequential patterns, to progressively discover sequential patterns in defined time period of interest (POI). The POI is a sliding window continuously advancing as the time goes by. Pisa utilizes a progressive sequential tree to efficiently maintain the latest data sequences, discover the complete set of up-to-date sequential patterns, and delete obsolete data and patterns accordingly. The height of the sequential pattern tree proposed is bounded by the length of POI, thereby effectively limiting the memory space required by Pisa that is significantly smaller than the memory needed by the alternative method, direct appending (DirApp). Note that the sequential pattern mining with a static database and with an incremental database are special cases of the progressive sequential pattern mining. By changing start time and end time of the POI, Pisa can easily deal with a static database or an incremental database as well. Complexity of algorithms proposed is analyzed. The experimental results show that Pisa not only significantly outperforms the prior methods in- - execution time by orders of magnitude but also possesses graceful scalability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Affect Analysis of Web Forums and Blogs Using Correlation Ensembles

    Page(s): 1168 - 1180
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4892 KB) |  | HTML iconHTML  

    Analysis of affective intensities in computer-mediated communication is important in order to allow a better understanding of online users' emotions and preferences. Despite considerable research on textual affect classification, it is unclear which features and techniques are most effective. In this study, we compared several feature representations for affect analysis, including learned n-grams and various automatically and manually crafted affect lexicons. We also proposed the support vector regression correlation ensemble (SVRCE) method for enhanced classification of affect intensities. SVRCE uses an ensemble of classifiers each trained using a feature subset tailored toward classifying a single affect class. The ensemble is combined with affect correlation information to enable better prediction of emotive intensities. Experiments were conducted on four test beds encompassing web forums, blogs, and online stories. The results revealed that learned n-grams were more effective than lexicon-based affect representations. The findings also indicated that SVRCE outperformed comparison techniques, including Pace regression, semantic orientation, and WordNet models. Ablation testing showed that the improved performance of SVRCE was attributable to its use of feature ensembles as well as affect correlation information. A brief case study was conducted to illustrate the utility of the features and techniques for affect analysis of large archives of online discourse. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Anonymization by Local Recoding in Data with Attribute Hierarchical Taxonomies

    Page(s): 1181 - 1194
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2355 KB) |  | HTML iconHTML  

    Individual privacy will be at risk if a published data set is not properly deidentified. k-anonymity is a major technique to de-identify a data set. Among a number of k-anonymization schemes, local recoding methods are promising for minimizing the distortion of a k-anonymity view. This paper addresses two major issues in local recoding k-anonymization in attribute hierarchical taxonomies. First, we define a proper distance metric to achieve local recoding generalization with small distortion. Second, we propose a means to control the inconsistency of attribute domains in a generalized view by local recoding. We show experimentally that our proposed local recoding method based on the proposed distance metric produces higher quality k-anonymity tables in three quality measures than a global recoding anonymization method, Incognito, and a multidimensional recoding anonymization method, Multi. The proposed inconsistency handling method is able to balance distortion and consistency of a generalized view. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Website Summarization by Image Content: A Case Study with Logo and Trademark Images

    Page(s): 1195 - 1204
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3550 KB) |  | HTML iconHTML  

    Image-based abstraction (or summarization) of a Web site is the process of extracting the most characteristic (or important) images from it. The criteria for measuring the importance of images in Web sites are based on their frequency of occurrence, characteristics of their content and Web link information. As a case study, this work focuses on logo and trademark images. These are important characteristic signs of corporate Web sites or of products presented there. The proposed method incorporates machine learning for distinguishing logo and trademarks from images of other categories (e.g., landscapes, faces). Because the same logo or trademark may appear many times in various forms within the same Web site, duplicates are detected and only unique logo and trademark images are extracted. These images are then ranked by importance taking frequency of occurrence, image content and Web link information into account. The most important logos and trademarks are finally selected to form the image-based summary of a Web site. Evaluation results of the method on real Web sites are also presented. The method has been implemented and integrated into a fully automated image-based summarization system which is accessible on the Web (www.intelligence.tuc.gr/websummarization) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Continuous k-Means Monitoring over Moving Objects

    Page(s): 1205 - 1216
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2519 KB) |  | HTML iconHTML  

    Given a data set P, a k-means query returns k points in space (called centers), such that the average squared distance between each point in P and its nearest center is minimized. Since this problem is NP-hard, several approximate algorithms have been proposed and used in practice. In this paper, we study continuous k-means computation at a server that monitors a set of moving objects. Reevaluating k-means every time there is an object update imposes a heavy burden on the server (for computing the centers from scratch) and the clients (for continuously sending location updates). We overcome these problems with a novel approach that significantly reduces the computation and communication costs, while guaranteeing that the quality of the solution, with respect to the reevaluation approach, is bounded by a user-defined tolerance. The proposed method assigns each moving object a threshold (i.e., range) such that the object sends a location update only when it crosses the range boundary. First, we develop an efficient technique for maintaining the k-means. Then, we present mathematical formulas and algorithms for deriving the individual thresholds. Finally, we justify our performance claims with extensive experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Phrase-Based Document Similarity for Clustering

    Page(s): 1217 - 1229
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2126 KB) |  | HTML iconHTML  

    In this paper, we propose a phrase-based document similarity to compute the pair-wise similarities of documents based on the suffix tree document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the vector space document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that, the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpass the results of traditional single-word textit{tf-idf} similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IDD: A Supervised Interval Distance-Based Method for Discretization

    Page(s): 1230 - 1238
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1845 KB) |  | HTML iconHTML  

    This article introduces a new method for supervised discretization based on interval distances by using a novel concept of neighbourhood in the target's space. The method proposed takes into consideration the order of the class attribute, when this exists, so that it can be used with ordinal discrete classes as well as continuous classes, in the case of regression problems. The method has proved to be very efficient in terms of accuracy and faster than the most commonly supervised discretization methods used in the literature. It is illustrated through several examples and a comparison with other standard discretization methods is performed for three public data sets by using two different learning tasks: a decision tree algorithm and SVM for regression. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Combining Neuro-Fuzzy Architectures with the Rough Set Theory to Solve Classification Problems with Incomplete Data

    Page(s): 1239 - 1253
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6486 KB) |  | HTML iconHTML  

    This paper presents a new approach to fuzzy classification in the case of missing features. The rough set theory is incorporated into neuro-fuzzy structures and the rough-neuro-fuzzy classifier is derived. The architecture of the classifier is determined by the modified indexed center of gravity (MICOG) defuzzification method. The structure of the classifier is presented in a general form, which includes both the Mamdani approach and the logical approach-based on the genuine fuzzy implications. A theorem, which allows the determination of the structures of rough-neuro-fuzzy classifiers based on the MICOG defuzzification, is given and proven. Specific rough-neuro-fuzzy structures based on the Larsen rule, the Reichenbach, and the Kleene-Dienes implications are given in details. In the experiments, it is shown that the classifier with the Dubois-Prade fuzzy implication is characterized by the best performance in the case of missing features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recommendation Method for Improving Customer Lifetime Value

    Page(s): 1254 - 1263
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2090 KB) |  | HTML iconHTML  

    It is important for online stores to improve customer lifetime value (LTV) if they are to increase their profits. Conventional recommendation methods suggest items that best coincide with user's interests to maximize the purchase probability, and this does not necessarily help improve LTV. We present a novel recommendation method that maximizes the probability of the LTV being improved, which can apply to both measured and subscription services. Our method finds frequent purchase patterns among high-LTV users and recommends items for a new user that simulate the found patterns. Using survival analysis techniques, we efficiently find the patterns from log data. Furthermore, we infer a user's interests from the purchase history based on maximum entropy models and use the interests to improve recommendation. Since a higher LTV is the result of greater user satisfaction, our method benefits users as well as online stores. We evaluate our method using two sets of real log data for measured and subscription services. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine

    Page(s): 1264 - 1272
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1917 KB) |  | HTML iconHTML  

    This work implements an enhanced hybrid classification method through the utilization of the naive Bayes approach and the support vector machine (SVM). In this project, the Bayes formula was used to vectorize (as opposed to classify) a document according to a probability distribution reflecting the probable categories that the document may belong to. The Bayes formula gives a range of probabilities to which the document can be assigned according to a predetermined set of topics (categories) such as those found in the "20 Newsgroups" data set for instance. Using this probability distribution as the vectors to represent the document, the SVM can then be used to classify the documents on a multidimensional level. The effects of an inadvertent dimensionality reduction caused by classifying using only the highest probability using the naive Bayes classifier can be overcome using the SVM by employing all the probability values associated with every category for each document. This method can be used for any data set and shows a significant reduction in training time as compared to the Lsquare method and significant improvement in the classification accuracy when compared to pure naive Bayes systems and also the TF-IDF/SVM hybrids. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cooperative Media Data Streaming with Scalable Video Coding

    Page(s): 1273 - 1281
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1644 KB) |  | HTML iconHTML  

    Peer-to-peer (P2P) streaming has become a promising approach for disseminating streaming media content from the server to a large number of interested clients. It still faces many challenges, however, such as high churn rate of peer clients, uplink bandwidth constraints of participating peers, and heterogeneity of client capacities. To resolve these issues, this paper presents a new P2P streaming framework that combines with the advantages of both mesh-based multisource overlay networks and scalable video coding techniques, specifically with multiple description coding (MDC), to improve the streaming quality of participating clients. In this paper, an optimized allocation policy (OAP) algorithm was proposed for multidescriptions allocation. Extensive simulations show that the proposed system achieves higher quality of service by peer-assisted cooperative streaming and MDC coding. In addition, we investigate an efficient cooperative caching mechanism for streaming service. The target is to provide low-latency and high-quality services by virtue of peer collaboration. The storage and replacement of caching content are based on a segment-based strategy. Through comparison, we demonstrate the effectiveness of the proposed scheme and compare with the traditional LRUF scheme through extensive experiments over large Internet-like topologies. Results show that the system outperforms some previous schemes in resource utilization and is more robust and resilient to node departure, which demonstrate that it is well suited for quality adaptive cooperative streaming applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GossipTrust for Fast Reputation Aggregation in Peer-to-Peer Networks

    Page(s): 1282 - 1295
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2509 KB) |  | HTML iconHTML  

    In peer-to-peer (P2P) networks, reputation aggregation and ranking are the most time-consuming and space-demanding operations. This paper proposes a new gossip protocol for fast score aggregation. We developed a Bloom filter architecture for efficient score ranking. These techniques do not require any secure hashing or fast lookup mechanism, thus are applicable to both unstructured and structured P2P networks. We report the design principles and performance results of a simulated GossipTrust reputation system. Randomized gossiping with effective use of power nodes enables light-weight aggregation and fast dissemination of global scores in O(log2 n) time steps, where n is the P2P network size. The Gossip-based protocol is designed to tolerate dynamic peer joining and departure, as well as to avoid possible peer collusions. The scheme has a considerably low gossiping message overhead, i.e. O(n log2 n) messages for n nodes. Bloom filters demand at most 512 KB memory per node for a 10,000-node network. We evaluate the performance of GossipTrust with distributed P2P file-sharing and parameter-sweeping applications. The simulation results demonstrate that GossipTrust has small aggregation time, low memory demand, and high ranking accuracy. These results suggest promising advantages of using the GossipTrust system for trusted P2P applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Join the IEEE Computer Society today! [advertisement]

    Save to Project icon | Request Permissions | PDF file iconPDF (63 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (104 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University