By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 1 • Date Jan. 2004

Filter Results

Displaying Results 1 - 17 of 17
  • An efficient and scalable algorithm for clustering XML documents by structure

    Publication Year: 2004 , Page(s): 82 - 96
    Cited by:  Papers (47)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1801 KB) |  | HTML iconHTML  

    With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance evaluation of an optimal cache replacement policy for wireless data dissemination

    Publication Year: 2004 , Page(s): 125 - 139
    Cited by:  Papers (30)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1173 KB) |  | HTML iconHTML  

    Data caching at mobile clients is an important technique for improving the performance of wireless data dissemination systems. However, variable data sizes, data updates, limited client resources, and frequent client disconnections make cache management a challenge. We propose a gain-based cache replacement policy, Min-SAUD, for wireless data dissemination when cache consistency must be enforced before a cached item is used. Min-SAUD considers several factors that affect cache performance, namely, access probability, update frequency, data size, retrieval delay, and cache validation cost. The paper employs stretch as the major performance metric since it accounts for the data service time and, thus, is fair when items have different sizes. We prove that Min-SAUD achieves optimal stretch under some standard assumptions. Moreover, a series of simulation experiments have been conducted to thoroughly evaluate the performance of Min-SAUD under various system configurations. The simulation results show that, in most cases, the Min-SAUD replacement policy substantially outperforms two existing policies, namely, LRU and SAIU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An enhanced concurrency control scheme for multidimensional index structures

    Publication Year: 2004 , Page(s): 97 - 111
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1492 KB) |  | HTML iconHTML  

    We propose an enhanced concurrency control algorithm that maximizes the concurrency of multidimensional index structures. The factors that deteriorate the concurrency of index structures are node splits and minimum bounding region (MBR) updates in multidimensional index structures. The properties of our concurrency control algorithm are as follows: First, to increase the concurrency by avoiding lock coupling during MBR updates, we propose the PLC (partial lock coupling) technique. Second, a new MBR update method is proposed. It allows searchers to access nodes where MBR updates are being performed. Finally, our algorithm holds exclusive latches not during whole split time but only during physical node split time that occupies the small part of a whole split process. For performance evaluation, we implement the proposed concurrency control algorithm and one of the existing link technique-based algorithms on MIDAS-III that is a storage system of a BADA-IV DBMS. We show through various experiments that our proposed algorithm outperforms the existing algorithm in terms of throughput and response time. Also, we propose a recovery protocol for our proposed concurrency control algorithm. The recovery protocol is designed to assure high concurrency and fast recovery. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Domain-specific Web search with keyword spices

    Publication Year: 2004 , Page(s): 17 - 27
    Cited by:  Papers (26)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (526 KB) |  | HTML iconHTML  

    Domain-specific Web search engines are effective tools for reducing the difficulty experienced when acquiring information from the Web. Existing methods for building domain-specific Web search engines require human expertise or specific facilities. However, we can build a domain-specific search engine simply by adding domain-specific keywords, called "keyword spices," to the user's input query and forwarding it to a general-purpose Web search engine. Keyword spices can be effectively discovered from Web documents using machine learning technologies. The paper describes domain-specific Web search engines that use keyword spices for locating recipes, restaurants, and used cars. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PEBL: Web page classification without negative examples

    Publication Year: 2004 , Page(s): 70 - 81
    Cited by:  Papers (30)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (534 KB) |  | HTML iconHTML  

    Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. The paper presents a framework, called positive example based learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called mapping-convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of "strong" negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples; the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Demand-driven caching in multiuser environment

    Publication Year: 2004 , Page(s): 112 - 124
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (950 KB) |  | HTML iconHTML  

    We propose a novel demand-driven caching framework, called cache-on-demand (CoD). In CoD, intermediate/final answers of existing running queries are viewed as virtual caches that can be materialized if they are beneficial to incoming queries. Such an approach is essentially nonspeculative: the exact cost of investment and the return on investment are known, and the cache is certain to be reused! We address several issues for CoD to be realized. We also propose three optimizing strategies: Conform-CoD, Scramble-CoD, and Integrated-CoD. Conform-CoD and Scramble-CoD are based on a two-phase optimization framework, while Integrated-CoD operates in a single-phase framework. We conducted extensive performance study to evaluate the effectiveness of these algorithms. Our results show that all the CoD-based schemes can provide substantial performance improvement when compared with a predictive scheme and a no-caching scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Personalized Web search for improving retrieval effectiveness

    Publication Year: 2004 , Page(s): 28 - 40
    Cited by:  Papers (65)  |  Patents (20)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1140 KB) |  | HTML iconHTML  

    Current Web search engines are built to serve all users, independent of the special needs of any individual user. Personalization of Web search is to carry out retrieval for each user incorporating his/her interests. We propose a novel technique to learn user profiles from users' search histories. The user profiles are then used to improve retrieval effectiveness in Web search. A user profile and a general profile are learned from the user's search history and a category hierarchy, respectively. These two profiles are combined to map a user query into a set of categories which represent the user's search intention and serve as a context to disambiguate the words in the user's query. Web search is conducted based on both the user query and the set of categories. Several profile learning and category mapping algorithms and a fusion algorithm are provided and evaluated. Experimental results indicate that our technique to personalize Web search is both effective and efficient. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining Web informative structures and contents based on entropy analysis

    Publication Year: 2004 , Page(s): 41 - 55
    Cited by:  Papers (27)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2033 KB) |  | HTML iconHTML  

    We study the problem of mining the informative structure of a news Web site that consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by these TOC pages. Based on the Hyperlink Induced Topics Search (HITS) algorithm, we propose an entropy-based analysis (LAMIS) mechanism for analyzing the entropy of anchor texts and links to eliminate the redundancy of the hyperlinked structure so that the complex structure of a Web site can be distilled. However, to increase the value and the accessibility of pages, most of the content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, copy announcements, etc. To further eliminate such redundancy, we propose another mechanism, called InfoDiscoverer, which applies the distilled structure to identify sets of article pages. InfoDiscoverer also employs the entropy information to analyze the information measures of article sets and to extract informative content blocks from these sets. Our result is useful for search engines, information agents, and crawlers to index, extract, and navigate significant information from a Web site. Experiments on several real news Web sites show that the precision and the recall of our approaches are much superior to those obtained by conventional methods in mining the informative structures of news Web sites. On the average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 122 to 257 percent when the desired recall falls between 0.5 and 1. In comparison with manual heuristics, the precision and the recall of InfoDiscoverer are greater than 0.956. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A unified probabilistic framework for Web page scoring systems

    Publication Year: 2004 , Page(s): 4 - 16
    Cited by:  Papers (20)  |  Patents (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1061 KB) |  | HTML iconHTML  

    The definition of efficient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information flooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only afford to visit the first positions of the returned list, which do not necessarily refer to the most appropriate answers. Some successful approaches to page ranking in a hyperlinked environment, like the Web, are based on link analysis. We propose a general probabilistic framework for Web page scoring systems (WPSS), which incorporates and extends many of the relevant models proposed in the literature. In particular, we introduce scoring systems for both generic (horizontal) and focused (vertical) search engines. Whereas horizontal scoring algorithms are only based on the topology of the Web graph, vertical ranking also takes the page contents into account and are the base for focused and user adapted search interfaces. Experimental results are reported to show the properties of some of the proposed scoring systems with special emphasis on vertical search. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic memory-based collaborative filtering

    Publication Year: 2004 , Page(s): 56 - 69
    Cited by:  Papers (24)  |  Patents (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1061 KB) |  | HTML iconHTML  

    Memory-based collaborative filtering (CF) has been studied extensively in the literature and has proven to be successful in various types of personalized recommender systems. In this paper, we develop a probabilistic framework for memory-based CF (PMCF). While this framework has clear links with classical memory-based CF, it allows us to find principled solutions to known problems of CF-based recommender systems. In particular, we show that a probabilistic active learning method can be used to actively query the user, thereby solving the "new user problem." Furthermore, the probabilistic framework allows us to reduce the computational cost of memory-based CF by working on a carefully selected subset of user profiles, while retaining high accuracy. We report experimental results based on two real-world data sets, which demonstrate that our proposed PMCF framework allows an accurate and efficient prediction of user preferences. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Guest editors' introduction: special section on mining and searching the web

    Publication Year: 2004 , Page(s): 2 - 3
    Save to Project icon | Request Permissions | PDF file iconPDF (199 KB) |  | HTML iconHTML  
    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Knowledge and Data Engineering - Table of Contents

    Publication Year: 2004 , Page(s): 0_1
    Save to Project icon | Request Permissions | PDF file iconPDF (279 KB)  
    Freely Available from IEEE
  • IEEE Computer Society's Staff List

    Publication Year: 2004 , Page(s): 0_2
    Save to Project icon | Request Permissions | PDF file iconPDF (221 KB)  
    Freely Available from IEEE
  • Editorial: state of the transactions

    Publication Year: 2004 , Page(s): 1
    Cited by:  Papers (40)
    Save to Project icon | Request Permissions | PDF file iconPDF (183 KB)  
    Freely Available from IEEE
  • 2003 Reviewers list

    Publication Year: 2004 , Page(s): 140 - 143
    Save to Project icon | Request Permissions | PDF file iconPDF (179 KB)  
    Freely Available from IEEE
  • TKDE: Information for authors

    Publication Year: 2004 , Page(s): 145
    Save to Project icon | Request Permissions | PDF file iconPDF (220 KB)  
    Freely Available from IEEE
  • IEEE Computer Society's Information

    Publication Year: 2004 , Page(s): 146
    Save to Project icon | Request Permissions | PDF file iconPDF (279 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University

Associate Editor-in-Chief
Xuemin Lin
University of New South Wales

Associate Editor-in-Chief
Lei Chen
Hong Kong University of Science and Technology