By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 10 • Date Oct. 2005

Filter Results

Displaying Results 1 - 16 of 16
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (136 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (75 KB)  
    Freely Available from IEEE
  • On combining classifier mass functions for text categorization

    Page(s): 1307 - 1319
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (768 KB) |  | HTML iconHTML  

    Experience shows that different text classification methods can give different results. We look here at a way of combining the results of two or more different classification methods using an evidential approach. The specific methods we have been experimenting with in our group include the support vector machine, kNN (nearest neighbors), kNN model-based approach (kNNM), and Rocchio methods, but the analysis and methods apply to any methods. We review these learning methods briefly, and then we describe our method for combining the classifiers. In a previous study, we suggested that the combination could be done using evidential operations and that using only two focal points in the mass functions gives good results. However, there are conditions under which we should choose to use more focal points. We assess some aspects of this choice from an reasoning perspective and suggest a refinement of the approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Continuous similarity-based queries on streaming time series

    Page(s): 1320 - 1332
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1288 KB) |  | HTML iconHTML  

    In many applications, local or remote sensors send in streams of data, and the system needs to monitor the streams to discover relevant events/patterns and deliver instant reaction correspondingly. An important scenario is that the incoming stream is a continually appended time series, and the patterns are time series in a database. At each time when a new value arrives (called a time position), the system needs to find, from the database, the nearest or near neighbors of the incoming time series up to the time position. This paper attacks the problem by using fast Fourier transform (FFT) to efficiently find the cross correlations of time series, which yields, in a batch mode, the nearest and near neighbors of the incoming time series at many time positions. To take advantage of this batch processing in achieving fast response time, this paper uses prediction methods to predict future values. When the prediction length is long, FFT is used to compute the cross correlations of the predicted series (with the values that have already arrived) and the database patterns, and to obtain predicted distances between the incoming time series at many future time positions and the database patterns. If the prediction length is short, the direct computation method is used to obtain these predicted distances to avoid the overhead of using FFT. When the actual data value arrives, the prediction error together with the predicted distances is used to filter out patterns that are not possible to be the nearest or near neighbors, which provides fast responses. Experiments show that with reasonable prediction errors, the performance gain is significant. Especially, when the long term predictions are available, the proposed method can handle incoming data at a very fast streaming rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using one-class and two-class SVMs for multiclass image annotation

    Page(s): 1333 - 1346
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1080 KB) |  | HTML iconHTML  

    We propose using one-class, two-class, and multiclass SVMs to annotate images for supporting keyword retrieval of images. Providing automatic annotation requires an accurate mapping of images' low-level perceptual features (e.g., color and texture) to some high-level semantic labels (e.g., landscape, architecture, and animals). Much work has been performed in this area; however, there is a lack of ability to assess the quality of annotation. In this paper, we propose a confidence-based dynamic ensemble (CDE), which employs a three-level classification scheme. At the base-level, CDE uses one-class support vector machines (SVMs) to characterize a confidence factor for ascertaining the correctness of an annotation (or a class prediction) made by a binary SVM classifier. The confidence factor is then propagated to the multiclass classifiers at subsequent levels. CDE uses the confidence factor to make dynamic adjustments to its member classifiers so as to improve class-prediction accuracy, to accommodate new semantics, and to assist in the discovery of useful low-level features. Our empirical studies on a large real-world data set demonstrate CDE to be very effective. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast algorithms for frequent itemset mining using FP-trees

    Page(s): 1347 - 1362
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1704 KB) |  | HTML iconHTML  

    Efficient algorithms for mining frequent itemsets are crucial for mining association rules as well as for many other data mining tasks. Methods for mining frequent itemsets have been implemented using a prefix-tree structure, known as an FP-tree, for storing compressed information about frequent itemsets. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this paper, we present a novel FP-array technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree-based algorithms. Our technique works especially well for sparse data sets. Furthermore, we present new algorithms for mining all, maximal, and closed frequent itemsets. Our algorithms use the FP-tree data structure in combination with the FP-array technique efficiently and incorporate various optimization techniques. We also present experimental results comparing our methods with existing algorithms. The results show that our methods are the fastest for many cases. Even though the algorithms consume much memory when the data sets are sparse, they are still the fastest ones when the minimum support is low. Moreover, they are always among the fastest algorithms and consume less memory than other methods when the data sets are dense. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An interactive approach to mining gene expression data

    Page(s): 1363 - 1378
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1912 KB) |  | HTML iconHTML  

    Effective identification of coexpressed genes and coherent patterns in gene expression data is an important task in bioinformatics research and biomedical applications. Several clustering methods have recently been proposed to identify coexpressed genes that share similar coherent patterns. However, there is no objective standard for groups of coexpressed genes. The interpretation of co-expression heavily depends on domain knowledge. Furthermore, groups of coexpressed genes in gene expression data are often highly connected through a large number of "intermediate" genes. There may be no clear boundaries to separate clusters. Clustering gene expression data also faces the challenges of satisfying biological domain requirements and addressing the high connectivity of the data sets. In this paper, we propose an interactive framework for exploring coherent patterns in gene expression data. A novel coherent pattern index is proposed to give users highly confident indications of the existence of coherent patterns. To derive a coherent pattern index and facilitate clustering, we devise an attraction tree structure that summarizes the coherence information among genes in the data set. We present efficient and scalable algorithms for constructing attraction trees and coherent pattern indices from gene expression data sets. Our experimental results show that our approach is effective in mining gene expression data and is scalable for mining large data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A data envelopment analysis-based approach for data preprocessing

    Page(s): 1379 - 1388
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB) |  | HTML iconHTML  

    In this paper, we show how the data envelopment analysis (DEA) model might be useful to screen training data so a subset of examples that satisfy monotonicity property can be identified. Using real-world health care and software engineering data, managerial monotonicity assumption, and artificial neural network (ANN) as a forecasting model, we illustrate that DEA-based data screening of training data improves forecasting accuracy of an ANN. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A shrinking-based clustering approach for multidimensional data

    Page(s): 1389 - 1403
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1224 KB) |  | HTML iconHTML  

    Existing data analysis techniques have difficulty in handling multidimensional data. Multidimensional data has been a challenge for data analysis because of the inherent sparsity of the points. In this paper, we first present a novel data preprocessing technique called shrinking which optimizes the inherent characteristic of distribution of data. This data reorganization concept can be applied in many fields such as pattern recognition, data clustering, and signal processing. Then, as an important application of the data shrinking preprocessing, we propose a shrinking-based approach for multidimensional data analysis which consists of three steps: data shrinking, cluster detection, and cluster evaluation and selection. The process of data shrinking moves data points along the direction of the density gradient, thus generating condensed, widely-separated clusters. Following data shrinking, clusters are detected by finding the connected components of dense cells (and evaluated by their compactness). The data-shrinking and cluster-detection steps are conducted on a sequence of grids with different cell sizes. The clusters detected at these scales are compared by a cluster-wise evaluation measurement, and the best clusters are selected as the final result. The experimental results show that this approach can effectively and efficiently detect clusters in both low and high-dimensional spaces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • State-space optimization of ETL workflows

    Page(s): 1404 - 1419
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1448 KB) |  | HTML iconHTML  

    Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. In this paper, we derive into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide an exhaustive and two heuristic algorithms toward the minimization of the execution cost of an ETL workflow. The heuristic algorithm with greedy characteristics significantly outperforms the other two algorithms for a large set of experimental cases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pattern discovery on Australian medical claim data - a systematic approach

    Page(s): 1420 - 1435
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2120 KB)  

    The national health insurance system in Australia records details on medical services and claims provided to its population. An effective method to the discovery of temporal behavioral patterns in the data set is proposed in this paper. The method consists of a two-step approach which is applied recursively to the data set. First, a clustering algorithm is used to segment the data into classes. Then, hidden Markov models are employed to find the underlying temporal behavioral patterns. These steps are applied recursively to features extracted from the data set until convergence. The main objective is to minimize the misclassification of patient profiles into various classes. This results in a hierarchical tree model consisting of a number of classes; each class groups similar patient temporal behavioral patterns together. The capabilities of the proposed method are demonstrated through the application to a subset of the Australian national health insurance data set. It is shown that the proposed method not only clusters data into various categories of interest, but it also automatically marks the periods in which similar temporal behavioral patterns occurred. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A formal framework for prefetching based on the type-level access pattern in object-relational DBMSs

    Page(s): 1436 - 1448
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1016 KB) |  | HTML iconHTML  

    Prefetching is an effective method for minimizing the number of fetches between the client and the server in a database management system. In this paper, we formally define the notion of prefetching. We also formally propose new notions of the type-level access locality and type-level access pattern. The type-level access locality is a phenomenon that repetitive patterns exist in the attributes referenced. The type-level access pattern is a pattern of attributes that are referenced in accessing the objects. We then develop an efficient capturing and prefetching policy based on this formal framework. Existing prefetching methods are based on object-level or page-level access patterns, which consist of object-ids or page-ids of the objects accessed. However, the drawback of these methods is that they work only when exactly the same objects or pages are accessed repeatedly. In contrast, even though the same objects are not accessed repeatedly, our technique effectively prefetches objects if the same attributes are referenced repeatedly, i.e., if there is type-level access locality. Many navigational applications in object-relational database management systems (ORDBMSs) have type-level access locality. Therefore, our technique can be employed in ORDBMSs to effectively reduce the number of fetches, thereby significantly enhancing the performance. We also address issues in implementing the proposed algorithm. We have conducted extensive experiments in a prototype ORDBMS to show effectiveness of our algorithm. Experimental results using the 007 benchmark, a real GIS application, and an XML application show that our technique reduces the number of fetches by orders of magnitude and improves the elapsed time by several factors over on-demand fetching and context-based prefetching, which is a state-of-the-art prefetching method. These results indicate that our approach provides a new paradigm in prefetching that improves performance of navigational applications significantly and is a practical method that can be implemented in commercial ORDBMSs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Call for Papers

    Page(s): 1449
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • [Advertisement]

    Page(s): 1450
    Save to Project icon | Request Permissions | PDF file iconPDF (347 KB)  
    Freely Available from IEEE
  • TKDE Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (75 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (136 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University