Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 8 • Date Aug. 2004

Filter Results

Displaying Results 1 - 15 of 15
  • [Front cover]

    Publication Year: 2004 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (147 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Publication Year: 2004 , Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • Influential rule search scheme (IRSS) - a new fuzzy pattern classifier

    Publication Year: 2004 , Page(s): 881 - 893
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1248 KB) |  | HTML iconHTML  

    Automatic generation of fuzzy rule base and membership functions from an input-output data set, for reliable construction of an adaptive fuzzy inference system, has become an important area of research interest. We propose a new robust, fast acting adaptive fuzzy pattern classification scheme, named influential rule search scheme (IRSS). In IRSS, rules which are most influential in contributing to the error produced by the adaptive fuzzy system are identified at the end of each epoch and subsequently modified for satisfactory performance. This fuzzy rule base adjustment scheme is accompanied by an output membership function adaptation scheme for fine tuning the fuzzy system architecture. This iterative method has shown a relatively high speed of convergence. Performance of the proposed IRSS is compared with other existing pattern classification schemes by implementing it for Fisher's iris data problem and Wisconsin breast cancer data problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning functions using randomized genetic code-like transformations: probabilistic properties and experimentations

    Publication Year: 2004 , Page(s): 894 - 908
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (880 KB) |  | HTML iconHTML  

    Inductive learning of nonlinear functions plays an important role in constructing predictive models and classifiers from data. We explore a novel randomized approach to construct linear representations of nonlinear functions proposed elsewhere [H. Kargupta (2001)], [H. Kargupta et al., (2002)]. This approach makes use of randomized codebooks, called the genetic code-like transformations (GCTs) for constructing an approximately linear representation of a nonlinear target function. We first derive some of the results presented elsewhere [H. Kargupta et al., (2002)] in a more general context. Next, it investigates different probabilistic and limit properties of GCTs. It also presents several experimental results to demonstrate the potential of this approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient disk-based K-means clustering for relational databases

    Publication Year: 2004 , Page(s): 909 - 921
    Cited by:  Papers (14)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (872 KB) |  | HTML iconHTML  

    K-means is one of the most popular clustering algorithms. We introduce an efficient disk-based implementation of K-means. The proposed algorithm is designed to work inside a relational database management system. It can cluster large data sets having very high dimensionality. In general, it only requires three scans over the data set. It is optimized to perform heavy disk I/O and its memory requirements are low. Its parameters are easy to set. An extensive experimental section evaluates quality of results and performance. The proposed algorithm is compared against the Standard K-means algorithm as well as the Scalable K-means algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining constrained gradients in large databases

    Publication Year: 2004 , Page(s): 922 - 938
    Cited by:  Papers (17)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (832 KB) |  | HTML iconHTML  

    Many data analysis tasks can be viewed as search or mining in a multidimensional space (MDS). In such MDSs, dimensions capture potentially important factors for given applications, and cells represent combinations of values for the factors. To systematically analyze data in MDS, an interesting notion, called "cubegrade" was recently introduced by Imielinski et al. [2002], which focuses on the notable changes in measures in MDS by comparing a cell (which we refer to as probe cell) with its gradient cells, namely, its ancestors, descendants, and siblings. We call such queries gradient analysis queries (GQs). Since an MDS can contain billions of cells, it is important to answer GQs efficiently. We focus on developing efficient methods for mining GQs constrained by certain (weakly) antimonotone constraints. Instead of conducting an independent gradient-cell search once per probe cell, which is inefficient due to much repeated work, we propose an efficient algorithm, LiveSet-Driven. This algorithm finds all good gradient-probe cell pairs in one search pass. It utilizes measure-value analysis and dimension-match analysis in a set-oriented manner, to achieve bidirectional pruning between the sets of hopeful probe cells and of hopeful gradient cells. Moreover, it adopts a hypertree structure and an H-cubing method to compress data and to maximize sharing of computation. Our performance study shows that this algorithm is efficient and scalable. In addition to data cubes, we extend our study to another important scenario: mining constrained gradients in transactional databases where each item is associated with some measures such as price. Such transactional databases can be viewed as sparse MDSs where items represent dimensions, although they have significantly different characteristics than data cubes. We outline efficient mining methods for this problem. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Privacy: a machine learning view

    Publication Year: 2004 , Page(s): 939 - 948
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    The problem of disseminating a data set for machine learning while controlling the disclosure of data source identity is described using a commuting diagram of functions. This formalization is used to present and analyze an optimization problem balancing privacy and data utility requirements. The analysis points to the application of a generalization mechanism for maintaining privacy in view of machine learning needs. We present new proofs of NP-hardness of the problem of minimizing information loss while satisfying a set of privacy requirements, both with and without the addition of a particular uniform coding requirement. As an initial analysis of the approximation properties of the problem, we show that the cell suppression problem with a constant number of attributes can be approximated within a constant. As a side effect, proofs of NP-hardness of the minimum k-union, maximum k-intersection, and parallel versions of these are presented. Bounded versions of these problems are also shown to be approximable within a constant. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TopCat: data mining for topic identification in a text corpus

    Publication Year: 2004 , Page(s): 949 - 964
    Cited by:  Papers (22)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1472 KB) |  | HTML iconHTML  

    TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient algorithm to compute differences between structured documents

    Publication Year: 2004 , Page(s): 965 - 979
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1304 KB) |  | HTML iconHTML  

    SGML/XML are having a profound impact on data modeling and processing. We present an efficient algorithm to compute differences between old and new versions of an SGML/XML document. The difference between the two versions can be considered to be an edit script that transforms one document tree into another. The proposed algorithm is based on a hybridization of bottom-up and top-down methods: The matching relationships between nodes in the two versions are produced in a bottom-up manner and then the top-down breadth-first search computes an edit script. Faster matching is achieved because the algorithm does not need to investigate the possible existence of matchings for all nodes. Furthermore, it can detect structurally meaningful changes such as the movement and copy of a subtree as well as simple changes to the node itself like insertion, deletion, and update. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multistrategy ensemble learning: reducing error by combining ensemble learning techniques

    Publication Year: 2004 , Page(s): 980 - 991
    Cited by:  Papers (22)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1656 KB) |  | HTML iconHTML  

    Ensemble learning strategies, especially boosting and bagging decision trees, have demonstrated impressive capacities to improve the prediction accuracy of base learning algorithms. Further gains have been demonstrated by strategies that combine simple ensemble formation approaches. We investigate the hypothesis that the improvement in accuracy of multistrategy approaches to ensemble learning is due to an increase in the diversity of ensemble members that are formed. In addition, guided by this hypothesis, we develop three new multistrategy ensemble learning techniques. Experimental results in a wide variety of natural domains suggest that these multistrategy ensemble learning techniques are, on average, more accurate than their component ensemble learning techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing top-k selection queries over multimedia repositories

    Publication Year: 2004 , Page(s): 992 - 1009
    Cited by:  Papers (49)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (736 KB) |  | HTML iconHTML  

    Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A query on these attributes will typically, request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well the object matches the selection condition (ranking). Furthermore, unlike in the relational model, users may just want the k top-ranked objects for their selection queries for a relatively small k. In addition to the differences in the query model, another peculiarity of multimedia repositories is that they may allow access to the attributes of each object only through indexes. We investigate how to optimize the processing of top-k selection queries over multimedia repositories. The access characteristics of the repositories and the above query model lead to novel issues in query optimization. In particular, the choice of the indexes used to search the repository strongly influences the cost of processing the filtering condition. We define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the search-minimal execution space is NP-hard, we present an efficient algorithm that solves the problem optimally with respect to our cost model and execution space when the predicates in the query are independent. We also show that the problem of optimizing top-k selection queries can be viewed, in many cases, as that of evaluating more traditional selection conditions. Thus, both problems can be viewed together as an extended filtering problem to which techniques of query processing and optimization may be adapted. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic control of workflow processes using ECA rules

    Publication Year: 2004 , Page(s): 1010 - 1023
    Cited by:  Papers (35)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1896 KB) |  | HTML iconHTML  

    Changes in recent business environments have created the necessity for a more efficient and effective business process management. The workflow management system is software that assists in defining business processes as well as automatically controlling the execution of the processes. We propose a new approach to the automatic execution of business processes using event-condition-action (ECA) rules that can be automatically triggered by an active database. First of all, we propose the concept of blocks that can classify process flows into several patterns. A block is a minimal unit that can specify the behaviors represented in a process model. An algorithm is developed to detect blocks from a process definition network and transform it into a hierarchical tree model. The behaviors in each block type are modeled using ACTA formalism. This provides a theoretical basis from which ECA rules are identified. The proposed ECA rule-based approach shows that it is possible to execute the workflow using the active capability of database without users' intervention. The operation of the proposed methods is illustrated through an example process. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • [Advertisement]

    Publication Year: 2004 , Page(s): 1024
    Save to Project icon | Request Permissions | PDF file iconPDF (397 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Publication Year: 2004 , Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • [Back cover]

    Publication Year: 2004 , Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (147 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University