By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 8 • Date Aug. 2007

Filter Results

Displaying Results 1 - 16 of 16
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (83 KB)  
    Freely Available from IEEE
  • Toward Exploratory Test-Instance-Centered Diagnosis in High-Dimensional Classification

    Page(s): 1001 - 1015
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5306 KB) |  | HTML iconHTML  

    High-dimensional data is a difficult case for most subspace-based classification methods because of the large number of combinations of dimensions, which have discriminatory power. This is because there are an exponential number of combinations of dimensions that could decide the correct class instance, and this combination could vary with data locality and test instance. Therefore, most summarized models such as decision trees and rule-based systems only aim to have a global summary of the data, which is used for classification. Because of this incompleteness, a particular classification model may be more or less suited to individual test instances. Furthermore, it may not provide sufficient insight into the most representative characteristics of a particular test instance. This is undesirable for many classification applications in which the diagnostic reasoning behind the classification of a test instance is as important as the classification process itself. In an interactive application, a user may find it more valuable to develop a diagnostic decision support method, which can reveal significant classification behaviors of exemplar records. Such an approach has the additional advantage of being able to optimize the decision process for the individual record in order to design more effective classification methods. In this paper, we propose the subspace decision path (SD-Path) method, which provides the user with the ability to interactively explore a small number of nodes of a hierarchical decision process so that the most significant classification characteristics for a given test instance are revealed. In addition, the SD-Path method can provide enormous interpretability by constructing views of the data in which the different classes are clearly separated out. Even in difficult cases where the classification behavior of the test instance is ambiguous, the SD-Path method provides a diagnostic understanding of the characteristics, which results in this ambigui- ty. Therefore, this method combines the abilities of the human and the computer in creating an effective diagnostic tool for instance-centered high-dimensional classification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling

    Page(s): 1016 - 1025
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2756 KB) |  | HTML iconHTML  

    With the vast amount of digitized textual materials now available on the Internet, it is almost impossible for people to absorb all pertinent information in a timely manner. To alleviate the problem, we present a novel approach for extracting hot topics from disparate sets of textual documents published in a given time period. Our technique consists of two steps. First, hot terms are extracted by mapping their distribution over time. Second, based on the extracted hot terms, key sentences are identified and then grouped into clusters that represent hot topics by using multidimensional sentence vectors. The results of our empirical tests show that this approach is more effective in identifying hot topics than existing methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data

    Page(s): 1026 - 1041
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4555 KB) |  | HTML iconHTML  

    This paper presents a new k-means type algorithm for clustering high-dimensional objects in sub-spaces. In high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. For example, in text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. The keywords for one cluster may not occur in the documents of other clusters. This is a data sparsity problem faced in clustering high-dimensional data. In the new algorithm, we extend the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k-means clustering process. An additional step is added to the k-means clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. The new algorithm is also scalable to large data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frequent Closed Sequence Mining without Candidate Maintenance

    Page(s): 1042 - 1056
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3646 KB) |  | HTML iconHTML  

    Previous studies have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent patterns but only the closed ones because the latter leads to not only a more compact yet complete result set but also better efficiency. However, most of the previously developed closed pattern mining algorithms work under the candidate maintenance-and- test paradigm, which is inherently costly in both runtime and space usage when the support threshold is low or the patterns become long. In this paper, we present BIDE, an efficient algorithm for mining frequent closed sequences without candidate maintenance. It adopts a novel sequence closure checking scheme called Bl-Directional Extension and prunes the search space more deeply compared to the previous algorithms by using the BackScan pruning method. A thorough performance study with both sparse and dense, real, and synthetic data sets has demonstrated that BIDE significantly outperforms the previous algorithm: It consumes an order(s) of magnitude less memory and can be more than an order of magnitude faster. It is also linearly scalable in terms of database size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maintaining Strong Cache Consistency for the Domain Name System

    Page(s): 1057 - 1071
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3127 KB) |  | HTML iconHTML  

    Effective caching in the domain name system (DNS) is critical to its performance and scalability. Existing DNS only supports weak cache consistency by using the time-to-live (TTL) mechanism, which functions reasonably well in normal situations. However, maintaining strong cache consistency in DNS as an indispensable exceptional handling mechanism has become more and more demanding for three important objectives: 1) to quickly respond and handle exceptions such as sudden and dramatic Internet failures caused by natural and human disasters, 2) to adapt increasingly frequent changes of Internet Protocol (IP) addresses due to the introduction of dynamic DNS techniques for various stationed and mobile devices on the Internet, and 3) to provide fine-grain controls for content delivery services to timely balance server load distributions. With agile adaptation to various exceptional Internet dynamics, strong DNS cache consistency improves the availability and reliability of Internet services. In this paper, we first conduct extensive Internet measurements to quantitatively characterize DNS dynamics. Then, we propose a proactive DNS cache update protocol (DNScup), running as middleware in DNS name servers, to provide strong cache consistency for DNS. The core of DNScup is an optimal lease scheme, called dynamic lease, to keep track of the local DNS name servers. We compare dynamic lease with other existing lease schemes through theoretical analysis and trace-driven simulations. Based on the DNS dynamic update protocol, we build a DNScup prototype with minor modifications to the current DNS implementation. Our system prototype demonstrates the effectiveness of DNScup and its easy and incremental deployment on the Internet. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Skyline and Top-k Retrieval in Subspaces

    Page(s): 1072 - 1088
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5388 KB) |  | HTML iconHTML  

    Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding different subsets of the dimensions. The existing algorithms are inadequate for subspace skyline/top-k search because they have at least one of the following defects: 1) they require scanning the entire database at least once, 2) they are optimized for one subspace but incur significant overhead for other subspaces, or 3) they demand expensive maintenance cost or space consumption. In this paper, we propose a technique SUBSKY, which settles both types of queries by using purely relational technologies. The core of SUBSKY is a transformation that converts multidimensional data to one-dimensional (1D) values. These values are indexed by a simple B-tree, which allows us to answer subspace queries by accessing a fraction of the database. SUBSKY entails low maintenance overhead, which equals the cost of updating a traditional B-tree. Extensive experiments with real data confirm that our technique outperforms alternative solutions significantly in both efficiency and scalability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Method for Estimating the Precision of Placename Matching

    Page(s): 1089 - 1101
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1605 KB) |  | HTML iconHTML  

    Information in digital libraries and information systems frequently refers to locations or objects in geographic space. Digital gazetteers are commonly employed to match the referred placenames with actual locations in information integration and data cleaning procedures. This process may fail due to missing information in the gazetteer, multiple matches, or false positive matches. We have analyzed the cases of success and reasons for failure of the mapping process to a gazetteer. Based on these, we present a statistical model that permits estimating 1) the completeness of a gazetteer with respect to the specific target area and application, 2) the expected precision and recall of one-to-one mappings of source placenames to the gazetteer, 3) the semantic inconsistency that remains in one-to-one mappings, and 4) the degree to which the precision and recall are improved under knowledge of the identity of higher levels in a hierarchy of places. The presented model is based on statistical analysis of the mapping process of a large set of placenames itself and does not require any other background data. The statistical model assumes that a gazetteer is populated by a stochastic process. The paper discusses how future work could take deviations from this assumption into account. The method has been applied to a real case. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ontology-Based Service Representation and Selection

    Page(s): 1102 - 1115
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3316 KB)  

    Selecting the right parties to interact with is a fundamental problem in open and dynamic environments. The problem is amplified when the number of interacting parties is high, and the parties' reasons for selecting others vary. We examine the problem of service selection in an e-commerce setting where consumer agents cooperate to identify service providers that would satisfy their service needs the most. Previous approaches to service selection are usually based on capturing and exchanging the ratings of consumers to providers. Rating-based approaches have two major weaknesses. (1) ratings are given in a particular context. Even though the context is crucial for interpreting the ratings correctly, the rating-based approaches do not provide the means to represent the context explicitly. (2) The satisfaction criteria of the rater is unknown. Without knowing the expectation of the rater, it is almost impossible to make sense of a rating. We deal with these two weaknesses in two steps. First, we extend a classical rating-based approach by adding a representation of context. This addition improves the accuracy of selected service providers only when two consumers with the same service request are assumed to be satisfied with the same service. Next, we replace ratings with detailed experiences of consumers. The experiences are represented with an ontology that can capture the requested service and the received service in detail. When a service consumer decides to share her experiences with a second service consumer, the receiving consumer evaluates the experience by using her own context and satisfaction criteria. By sharing experiences rather than ratings, the service consumers can model service providers more accurately and, thus, can select service providers that are better suited for their needs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Family of Directional Relation Models for Extended Objects

    Page(s): 1116 - 1130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2861 KB) |  | HTML iconHTML  

    In this paper, we introduce a family of expressive models for qualitative spatial reasoning with directions. The proposed family is based on the cognitive plausible cone-based model. We formally define the directional relations that can be expressed in each model of the family. Then, we use our formal framework to study two interesting problems: computing the inverse of a directional relation and composing two directional relations. For the composition operator, in particular, we concentrate on two commonly used definitions, namely, consistency-based and existential composition. Our formal framework allows us to prove that our solutions are correct. The presented solutions are handled in a uniform manner and apply to all of the models of the family. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Three Types of Covering-Based Rough Sets

    Page(s): 1131 - 1144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (308 KB) |  | HTML iconHTML  

    Rough set theory is a useful tool for data mining. It is based on equivalence relations and has been extended to covering-based generalized rough set. This paper studies three kinds of covering generalized rough sets for dealing with the vagueness and granularity in information systems. First, we examine the properties of approximation operations generated by a covering in comparison with those of the Pawlak's rough sets. Then, we propose concepts and conditions for two coverings to generate an identical lower approximation operation and an identical upper approximation operation. After the discussion on the interdependency of covering lower and upper approximation operations, we address the axiomization issue of covering lower and upper approximation operations. In addition, we study the relationships between the covering lower approximation and the interior operator and also the relationships between the covering upper approximation and the closure operator. Finally, this paper explores the relationships among these three types of covering rough sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Localized Outlying and Boundary Data Detection in Sensor Networks

    Page(s): 1145 - 1157
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2297 KB)  

    This paper targets the identification of outlying sensors (that is, outlying reading sensors) and the detection of the reach of events in sensor networks. Typical applications include the detection of the transportation front line of some vegetation or animalcule's growth over a certain geographical region. We propose and analyze two novel algorithms for outlying sensor identification and event boundary detection. These algorithms are purely localized and, thus, scale well to large sensor networks. Their computational overhead is low, since only simple numerical operations are involved. Simulation results indicate that these algorithms can clearly detect the event boundary and can identify outlying sensors with a high accuracy and a low false alarm rate when as many as 20 percent sensors report outlying readings. Our work is exploratory in that the proposed algorithms can accept any kind of scalar values as inputs¿a dramatic improvement over existing work, which takes only 0/1 decision predicates. Therefore, our algorithms are generic. They can be applied as long as "events¿ can be modeled by numerical numbers. Though designed for sensor networks, our algorithms can be applied to the outlier detection and regional data analysis in spatial data mining. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Join the IEEE Computer Society

    Page(s): 1158 - 1160
    Save to Project icon | Request Permissions | PDF file iconPDF (356 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (83 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University