By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 2 • Date Mar/Apr 2002

Filter Results

Displaying Results 1 - 18 of 18
  • Analyzing outliers cautiously

    Publication Year: 2002 , Page(s): 432 - 437
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (478 KB) |  | HTML iconHTML  

    Outliers are difficult to handle because some of them can be measurement errors, while others may represent phenomena of interest, something "significant" from the viewpoint of the application domain. Statistical and computational methods have been proposed to detect outliers, but further analysis of outliers requires much relevant domain knowledge. In our previous work (1994), we suggested a knowledge-based method for distinguishing between the measurement errors and phenomena of interest by modeling "real measurements" - how measurements should be distributed in an application domain. In this paper, we make this distinction by modeling measurement errors instead. This is a cautious approach to outlier analysis, which has been successfully applied to a medical problem and may find interesting applications in other domains such as science, engineering, finance, and economics View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast indexing and visualization of metric data sets using slim-trees

    Publication Year: 2002 , Page(s): 244 - 260
    Cited by:  Papers (30)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1530 KB) |  | HTML iconHTML  

    Many recent database applications need to deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The slim-tree uses the triangle inequality to prune the distance calculations that are needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the slim-tree uses a minimal spanning tree to help with the splitting. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The slim-tree is a metric access method that tackles the problem of overlaps between nodes in metric spaces and that allows one to minimize the overlap. The proposed "fat-factor" is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed "slim-down" algorithm. This paper also presents a new tool in the slim-tree's arsenal of resources, aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new slim-tree algorithms lead to performance improvements. These results show that the slim-tree outperforms the M-tree by up to 200% for range queries. For insertion and splitting, the minimal-spanning-tree-based algorithm achieves up to 40 times faster insertions. We observed improvements of up to 40% in range queries after applying the slim-down algorithm View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient C4.5 [classification algorithm]

    Publication Year: 2002 , Page(s): 438 - 444
    Cited by:  Papers (52)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (309 KB) |  | HTML iconHTML  

    We present an analytic evaluation of the runtime behavior of the C4.5 algorithm which highlights some efficiency improvements. Based on the analytic evaluation, we have implemented a more efficient version of the algorithm, called EC4.5. It improves on C4.5 by adopting the best among three strategies for computing the information gain of continuous attributes. All the strategies adopt a binary search of the threshold in the whole training set starting from the local threshold computed at a node. The first strategy computes the local threshold using the algorithm of C4.5, which, in particular, sorts cases by means of the quicksort method. The second strategy also uses the algorithm of C4.5, but adopts a counting sort method. The third strategy calculates the local threshold using a main-memory version of the RainForest algorithm, which does not need sorting. Our implementation computes the same decision trees as C4.5 with a performance gain of up to five times View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Solving constraint optimization problems from CLP-style specifications using heuristic search techniques

    Publication Year: 2002 , Page(s): 353 - 368
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (473 KB) |  | HTML iconHTML  

    Presents a framework for efficiently solving logic formulations of combinatorial optimization problems using heuristic search techniques. In order to integrate cost, lower-bound and upper-bound specifications with conventional logic programming languages, we augment a constraint logic programming (CLP) language with embedded constructs for specifying the cost function and with a few higher-order predicates for specifying the lower and upper bound functions. We illustrate how this simple extension vastly enhances the ease with which optimization problems involving combinations of Min and Max can be specified in the extended language CLP* and we show that CSLDNF (Constraint SLD resolution with Negation as Failure) resolution schemes are not efficient for solving optimization problems specified in this language. Therefore, we describe how any problem specified using CLP* can be converted into an implicit AND/OR graph, and present an algorithm called GenSolve which can branch-and-bound using upper and lower bound estimates, thus exploiting the full pruning power of heuristic search techniques. A technical analysis of GenSolve is provided. We also provide experimental results comparing various control strategies for solving CLP* programs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Redefining clustering for high-dimensional applications

    Publication Year: 2002 , Page(s): 210 - 225
    Cited by:  Papers (31)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (427 KB) |  | HTML iconHTML  

    Clustering problems are well-known in the database literature for their use in numerous applications, such as customer segmentation, classification, and trend analysis. High-dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that, in high-dimensional data, even the concept of proximity or clustering may not be meaningful. We introduce a very general concept of projected clustering which is able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than the currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for high-dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable and are likely to trade-off with better accuracy View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An ElGamal-like cryptosystem for enciphering large messages

    Publication Year: 2002 , Page(s): 445 - 446
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (201 KB) |  | HTML iconHTML  

    In practice, we usually require two cryptosystems, an asymmetric one and a symmetric one, to encrypt a large confidential message. The asymmetric cryptosystem is used to deliver secret key SK, while the symmetric cryptosystem is used to encrypt a large confidential message with the secret key SK. In this article, we propose a simple cryptosystem which allows a large confidential message to be encrypted efficiently. Our scheme is based on the Diffie-Hellman distribution scheme, together with the ElGamal cryptosystem View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On scheduling atomic and composite continuous media objects

    Publication Year: 2002 , Page(s): 447 - 455
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (319 KB) |  | HTML iconHTML  

    In multiuser multimedia information systems (e.g., movie-on-demand, digital-editing), scheduling the retrievals of continuous media objects becomes a challenging task. This is because of both intra and inter lobject time dependencies. Intraobject time dependency refers to the real-time display requirement of a continuous media object. Interobject time dependency is the temporal relationships defined among multiple continuous media objects. In order to compose tailored multimedia presentations, a user might define complex time dependencies among multiple continuous media objects with various lengths and display bandwidths. Scheduling the retrieval tasks corresponding to the components of such a presentation in order to respect both inter and intra task time dependencies is the focus of this study. To tackle this task scheduling problem (CRS), we start with a simpler scheduling problem (ARS) where there is no inter task time dependency (e.g., movie-on-demand). Next, we investigate an augmented version of ARS (termed ARS+) where requests reserve displays in advance (e.g., reservation-based movie-on-demand). Finally, we extend our techniques proposed for ARS and ARS+ to address the CRS problem. We also provide formal definition of these scheduling problems and proof of their NP-hardness View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparative study of various nested normal forms

    Publication Year: 2002 , Page(s): 369 - 385
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (427 KB) |  | HTML iconHTML  

    As object-relational databases (ORDBs) become popular in the industry, it is important for database designers to produce database schemes with good properties in these new kinds of databases. One distinguishing feature of an ORDB is that its tables may not be in first normal form. Hence, ORDBs may contain nested relations along with other collection types. To help the design process of an ORDB, several normal forms for nested relations have recently been defined, and some of them are called nested normal forms. In this paper, we investigate four nested normal forms, which are NNF [20], NNF [21], NNF [23], and NNF [25], with respect to generalizing 4NF and BCNF, reducing redundant data values, and design flexibility. Another major contribution of this paper is that we provide an improved algorithm that generates nested relation schemes in NNF [20] from an a-acyclic database scheme, which is the most general type of acyclic database schemes. After presenting the algorithm for NNF [20], the algorithms of all of the four nested normal forms and the nested database schemes that they generate are compared. We discovered that when the given set of MVDs is not conflict-free, NNF [20] is inferior to the other three nested normal forms in reducing redundant data values. However, in all of the other cases considered in this paper, NNF [20] is at least as good as all of the other three nested normal forms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A content-based authorization model for digital libraries

    Publication Year: 2002 , Page(s): 296 - 315
    Cited by:  Papers (23)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (596 KB) |  | HTML iconHTML  

    Digital libraries (DLs) introduce several challenging requirements with respect to the formulation, specification and enforcement of adequate data protection policies. Unlike conventional database environments, a DL environment is typically characterized by a dynamic user population, often making accesses from remote locations, and by an extraordinarily large amount of multimedia information, stored in a variety of formats. Moreover, in a DL environment, access policies are often specified based on user qualifications and characteristics, rather than on user identity (e.g. a user can be given access to an R-rated video only if he/ she is more than 18 years old). Another crucial requirement is the support for content-dependent authorizations on digital library objects (e.g. all documents containing discussions on how to operate guns must be made available only to users who are 18 or older). Since traditional authorization models do not adequately meet the access control requirements typical of DLs, we propose a content-based authorization model that is suitable for a DL environment. Specifically, the most innovative features of our authorization model are: (1) flexible specification of authorizations based on the qualifications and (positive and negative) characteristics of users, (2) both content-dependent and content-independent access control to digital library objects, and (3) the varying granularity of authorization objects ranging from sets of library objects to specific portions of objects View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SEAM: A state-entity-activity-model for a well-defined workflow development methodology

    Publication Year: 2002 , Page(s): 415 - 431
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (486 KB) |  | HTML iconHTML  

    Current conceptual workflow models use either informally defined conceptual models or several formally defined conceptual models that capture different aspects of the workflow, e.g., the data, process, and organizational aspects of the workflow. To the best of our knowledge, there are no algorithms that can amalgamate these models to yield a single view of reality. A fragmented conceptual view is useful for systems analysis and documentation. However, it fails to realize the potential of conceptual models to provide a convenient interface to automate the design and management of workflows. First, as a step toward accomplishing this objective, we propose SEAM (State-Entity-Activity-Model), a conceptual workflow model defined in terms of set theory. Second, no attempt has been made, to the best of our knowledge, to incorporate time into a conceptual workflow model. SEAM incorporates the temporal aspect of workflows. Third, we apply SEAM to a real-life organizational unit's workflows. In this work, we show a subset of the workflows modeled for this organization using SEAM. We also demonstrate, via a prototype application, how the SEAM schema can be implemented on a relational database management system. We present the lessons we learned about the advantages obtained for the organization and, for developers who choose to use SEAM, we also present potential pitfalls in using the SEAM methodology to build workflow systems on relational platforms. The information contained in this work is sufficient enough to allow application developers to utilize SEAM as a methodology to analyze, design, and construct workflow applications on current relational database management systems. The definition of SEAM as a context-free grammar, definition of its semantics, and its mapping to relational platforms should be sufficient also, to allow the construction of an automated workflow design and construction tool with SEAM as the user interface View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A performance study of robust load sharing strategies for distributed heterogeneous Web server systems

    Publication Year: 2002 , Page(s): 398 - 414
    Cited by:  Papers (13)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB) |  | HTML iconHTML  

    Replication of information across multiple servers is becoming a common approach to support popular Web sites. A distributed architecture with some mechanisms to assign client requests to Web servers is more scalable than any centralized or mirrored architecture. In this paper, we consider distributed systems in which the Authoritative Domain Name Server (ADNS) of the Web site takes the request dispatcher role by mapping the URL hostname into the IP address of a visible node, that is, a Web server or a Web cluster interface. This architecture can support local and geographical distribution of the Web servers. However, the ADNS controls only a very small fraction of the requests reaching the Web site because the address mapping is not requested for each client access. Indeed, to reduce Internet traffic, address resolution is cached at various name servers for a time-to-live (TTL) period. This opens an entirely new set of problems that traditional centralized schedulers of parallel/distributed systems do not have to face. The heterogeneity assumption on Web node capacity, which is much more likely in practice, increases the order of complexity of the request assignment problem and severely affects the applicability and performance of the existing load sharing algorithms. We propose new assignment strategies, namely adaptive TTL schemes, which tailor the TTL value for each address mapping instead of using a fixed value for all mapping requests. The adaptive TTL schemes are able to address both the nonuniformity of client requests and the heterogeneous capacity of Web server nodes. Extensive simulations show that the proposed algorithms are very effective in avoiding node overload, even for high levels of heterogeneity and limited ADNS control View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Object-orientated design of digital library platforms for multiagent environments

    Publication Year: 2002 , Page(s): 281 - 295
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (601 KB) |  | HTML iconHTML  

    The application of an object-oriented (OO) methodology to the design of a platform for heterogeneous digital libraries with multiagent technologies is demonstrated. Emphasis is placed on maximizing the autonomy of the query processing activity. The Fusion OO paradigm is specifically employed as the basis for the design process due to the significance attributed to the development of object interfaces under static and dynamic conditions. Finally, characteristics of the proposed Domain Index Server (DIS) platform are contrasted with those of an alternative platform (the University of Michigan Digital Library, UMDL) by way of the respective Fusion descriptions. This identifies a different emphasis on the interface design between objects in the two platforms: the DIS system uses more dynamic links while the UMDL system focuses on permanent and constant links. Simulation of the two platforms provides performance data that demonstrates the higher capacity of the DIS scheme View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Load balancing of parallelized information filters

    Publication Year: 2002 , Page(s): 456 - 461
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (417 KB) |  | HTML iconHTML  

    We investigate the data-parallel implementation of a set of "information filters" used to rule out uninteresting data from a database or data stream. We develop an analytic model for the costs and advantages of load rebalancing for the parallel filtering processes, as well as a quick heuristic for its desirability. Our model uses binomial models of the filter processes and fits key parameters to the results of extensive simulations. Experiments confirm our model. Rebalancing should pay off whenever processor communications costs are high. Further experiments showed it can also pay off even with low communications costs for 16-64 processes and 1-10 data items per processor; then, imbalances can increase processing time by up to 52 percent in representative cases, and rebalancing can increase it by 78 percent, so our quick predictive model can be valuable. Results also show that our proposed heuristic rebalancing criterion gives close to optimal balancing. We also extend our model to handle variations in filter processing time per data item View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Weighted fuzzy reasoning using weighted fuzzy Petri nets

    Publication Year: 2002 , Page(s): 386 - 397
    Cited by:  Papers (33)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (358 KB) |  | HTML iconHTML  

    This paper presents a Weighted Fuzzy Petri Net model (WFPN) and proposes a weighted fuzzy reasoning algorithm for rule-based systems based on Weighted Fuzzy Petri Nets. The fuzzy production rules in the knowledge base of a rule-based system are modeled by Weighted Fuzzy Petri Nets, where the truth values of the propositions appearing in the fuzzy production rules and the certainty factors of the rules are represented by fuzzy numbers. Furthermore, the weights of the propositions appearing in the rules are also represented by fuzzy numbers. The proposed weighted fuzzy reasoning algorithm can allow the rule-based systems to perform fuzzy reasoning in a more flexible and more intelligent manner View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Production systems with negation as failure

    Publication Year: 2002 , Page(s): 336 - 352
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (397 KB) |  | HTML iconHTML  

    We study action rule-based systems with two forms of negation, namely classical negation and "negation as failure to find a course of action". We show, by means of several examples, that adding negation as failure to such systems increases their expressiveness in the sense that real-life problems can be represented in a natural and simple way. Then we address the problem of providing a formal declarative semantics to these extended systems by adopting an argumentation-based approach which has been shown to be a simple unifying framework for understanding the declarative semantics of various nonmonotonic formalisms. In this way, we naturally define the grounded (well-founded), stable and preferred semantics for production systems with negation as failure. Next, we characterize the class of stratified production systems, which enjoy the properties that the above-mentioned semantics coincide and that negation as failure to find a course of action can be computed by a simple bottom-up operator. Stratified production systems can be implemented on top of conventional production systems in two ways. The first way corresponds to the understanding of stratification as a form of priority assignment between rules. We show that this implementation, though sound, is not complete in the general case. Hence, we propose a second implementation by means of an algorithm which transforms a finite stratified production system into a classical one. This is a sound and complete implementation, though it is computationally hard View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Trigger condition testing and view maintenance using optimized discrimination networks

    Publication Year: 2002 , Page(s): 261 - 280
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (743 KB) |  | HTML iconHTML  

    Presents a structure that can be used both for trigger condition testing and view materialization in active databases, along with a study of techniques for optimizing the structure. The structure presented is known as a discrimination network. The type of discrimination network introduced and studied in this paper is a highly general type of discrimination network which we call the Gator network. The structure of several alternative Gator network optimizers is described, along with a discussion of optimizer performance, output quality and accuracy. The optimizers can choose an efficient Gator network for testing the conditions of a set of triggers or optimizing maintenance of a set of views, given information about the structure of the triggers or views, database size, predicate selectivity and update frequency distribution. The efficiency of optimized Gator networks relative to alternatives is analyzed. The results indicate that, overall, Gator networks can be optimized effectively and can give excellent performance for trigger condition testing and materialization of views View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A database perspective on geospatial data modeling

    Publication Year: 2002 , Page(s): 226 - 243
    Cited by:  Papers (4)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (709 KB) |  | HTML iconHTML  

    We study the representation and manipulation of geospatial information in a database management system (DBMS). The geospatial data model that we use as a basis hinges on a complex object model, whose set and tuple constructors make it efficient for defining not only collections of geographic objects but also relationships among these objects. In addition, it allows easy manipulation of nonbasic types, such as spatial data types. We investigate the mapping of our reference model onto major commercial DBMS models, namely a relational model extended to abstract data types (ADT) and an object-oriented model. Our analysis shows the strengths and limits of the two model types for handling highly structured data with spatial components View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high-level Petri nets-based approach to verifying task structures

    Publication Year: 2002 , Page(s): 316 - 335
    Cited by:  Papers (7)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1720 KB)  

    As knowledge-based system technology gains wider acceptance, there is an increasing need to verify knowledge-based systems to improve their reliability and quality. Traditionally, attention has been given to verifying knowledge-based systems at the knowledge level, which only addresses structural errors such as redundancy, conflict and circularity in rule bases. No semantic errors (such as inconsistency at the requirements specification level) have been checked. In this paper, we propose the use of task structures for modeling user requirements and domain knowledge at the requirements specification level, and the use of high-level Petri nets for expressing and verifying the task structure-based specifications. Issues in mapping task structures onto high-level Petri nets are identified, e.g. the representation of task decomposition, constraints and the state model; the distinction between the "follow" and "immediately follow" operators; and the "composition" operator in task structures. The verification of task structures using high-level Petri nets is performed on model specifications of a task through constraint satisfaction and relaxation techniques, and on process specifications of the task based on the reachability property and the notion of specificity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University

Associate Editor-in-Chief
Xuemin Lin
University of New South Wales

Associate Editor-in-Chief
Lei Chen
Hong Kong University of Science and Technology