By Topic

Research Issues in Data Engineering: Stream Data Mining and Applications, 2005. RIDE-SDMA 2005. 15th International Workshop on

Date 3-4 April 2005

Filter Results

Displaying Results 1 - 16 of 16
  • 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications - Cover

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (47 KB)  
    Freely Available from IEEE
  • 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications

    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications - Copyright Page

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications - Table of contents

    Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • Preface

    Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (98 KB)  
    Freely Available from IEEE
  • An efficient algorithm for incremental mining of association rules

    Page(s): 3 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB) |  | HTML iconHTML  

    Incremental algorithms can manipulate the results of earlier mining to derive the final mining output in various businesses. This study proposes a new algorithm, called the New Fast UPdate algorithm (NFUP) for efficiently incrementally mining association rules from a large transaction database. NFUP is a backward method that only requires scanning incremental database. Rather than rescanning the original database for some new generated frequent itemsets in the incremental database, we accumulate the occurrence counts of newly generated frequent itemsets and delete infrequent itemsets obviously. Thus, NFUP need not rescan the original database and to discover newly generated frequent itemsets. NFUP has good scalability in our simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Online mining (recently) maximal frequent itemsets over data streams

    Page(s): 11 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    A data stream is a massive, open-ended sequence of data elements continuously generated at a rapid rate. Mining data streams is more difficult than mining static databases because the huge, high-speed and continuous characteristics of streaming data. In this paper, we propose a new one-pass algorithm called DSM-MFI (stands for Data Stream Mining for Maximal Frequent Itemsets), which mines the set of all maximal frequent itemsets in landmark windows over data streams. A new summary data structure called summary frequent itemset forest (abbreviated as SFI-forest) is developed for incremental maintaining the essential information about maximal frequent itemsets embedded in the stream so far. Theoretical analysis and experimental studies show that the proposed algorithm is efficient and scalable for mining the set of all maximal frequent itemsets over the entire history of the data streams. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A clustering method using an irregular size cell graph

    Page(s): 19 - 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    In this paper we propose a clustering method (data mining technique) called "FlexDice" for large high-dimensional datasets. The data structure used in FlexDice is a graph-structure. Its data structure and the data structure of Quadtree have a few same features, but they have some crucial differences. The most crucial difference is that the data structure of Quadtree is a tree-structure while the data structure used in FlexDice is a graph-structure. In this paper we show the differences between these structures. Quadtree is a tree-structure, and the algorithm constructing it forms cells hierarchically by dividing data object spaces in a top-down manner. That is why traversing operations from the root of the tree to each of its leaves is necessary in such methods of searching for or indexing of data objects. In contrast to the case of Quadtree, no tree-structure is required in the algorithm FlexDice, because such traversing operations are unnecessary. However in the clustering method, relevant cells which include each of the similar data objects must be merged, instead of choosing a hyper-dividing plane. Hence, FlexDice creates neighboring links among relevant cells in every layer after dividing cells, and merges such cells including similar data objects. To reduce memory usage, FlexDice dynamically removes worthless cells, and maintains only worthwhile cells including data objects and parent cells needed for creating neighboring links of worthwhile cells. After neighboring links among worthwhile cells are created, these parent cells needed for creating neighboring links of worthwhile cells are removed from memory. In this paper we present dissimilarity between the data structure used in FlexDice and the structure of Quadtree, and show that the data structure used in FlexDice is suitable for clustering. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using probabilistic latent semantic analysis for Web page grouping

    Page(s): 29 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB) |  | HTML iconHTML  

    The locality of Web pages within a Web site is initially determined by the designer's expectation. Web usage mining can discover the patterns in the navigational behaviour of Web visitors, in turn, improve Web site functionality and service designing by considering users' actual opinion. Conventional Web page clustering technique is often utilized to reveal the functional similarity of Web pages. However, high-dimensional computation problem will be incurred due to taking user transaction as dimension. In this paper, we propose a new Web page grouping approach based on a probabilistic latent semantic analysis (PLSA) model. An iterative algorithm based on maximum likelihood principle is employed to overcome the aforementioned computational shortcoming. The Web pages are classified into various groups according to user access patterns. Meanwhile, the semantic latent factors or tasks are characterized by extracting the content of "dominant" pages related to the factors. We demonstrate the effectiveness of our approach by conducting experiments on real world data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maintaining knowledge-bases of navigational patterns from streams of navigational sequences

    Page(s): 37 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (95 KB) |  | HTML iconHTML  

    In this paper we explore an alternative design goal for navigational pattern discovery in stream environments. Instead of mining based on thresholds and returning the patterns that satisfy the specified threshold(s), we propose to mine without thresholds and return all identified patterns along with their support counts in a single pass. We utilize a sliding window to capture recent navigational sequences and propose a batch-update strategy for maintaining the patterns within a sliding window. Our batch-update strategy depends on the ability to efficiently mine the navigational patterns without support thresholds. To achieve this, we have designed an efficient algorithm for mining contiguous navigational patterns without support thresholds. Our experiments show that our algorithm outperforms the existing techniques for mining contiguous navigational patterns. Our experiments also show that the proposed batch-update strategy achieves considerable speed-ups compared to the existing window update strategy, which requires total re-computation of patterns within each new window. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data mining approaches to software fault diagnosis

    Page(s): 45 - 52
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    Automatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior and the use of appropriate data mining techniques on the chosen representation. In this paper we use the sequence of system calls to characterize program execution. The data mining tasks addressed are learning to map system call streams to fault labels and automatic identification of fault causes. Spectrum kernels and SVM are used for the former while latent semantic analysis is used for the latter The techniques are demonstrated for the intrusion dataset containing system call traces. The results show that kernel techniques are as accurate as the best available results but are faster by orders of magnitude. We also show that latent semantic indexing is capable of revealing fault-specific features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Handling nominal features in anomaly intrusion detection problems

    Page(s): 55 - 62
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB) |  | HTML iconHTML  

    Computer network data stream used in intrusion detection usually involve many data types. A common data type is that of symbolic or nominal features. Whether being coded into numerical values or not, nominal features need to be treated differently from numeric features. This paper studies the effectiveness of two approaches in handling nominal features: a simple coding scheme via the use of indicator variables and a scaling method based on multiple correspondence analysis (MCA). In particular, we apply the techniques with two anomaly detection methods: the principal component classifier (PCC) and the Canberra metric. The experiments with KDD 1999 data demonstrate that MCA works better than the indicator variable approach for both detection methods with the PCC coming much ahead of the Canberra metric. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-decaying Bloom Filters for data streams with skewed distributions

    Page(s): 63 - 69
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB) |  | HTML iconHTML  

    Bloom Filters are space-efficient data structures for membership queries over sets. To enable queries for multiplicities of multi-sets, the bitmap in a Bloom Filter is replaced by an array of counters whose values increment on each occurrence. In a data stream model, however, data items arrive at varying rates and recent occurrences are often regarded as more significant than past ones. In most data stream applications, it is critical to handle this "time-sensitivity". Furthermore, data streams with skewed distributions are common in many emerging applications, e.g., traffic engineering and billing, intrusion detection, trading surveillance and outlier detection. For such applications, it is inefficient to allocate counters of uniform size to all buckets. In this paper, we present Time-decaying Bloom Filters (TBF), a Bloom Filter that maintains the frequency count for each item in a data stream, and the value of each counter decays with time. For data streams with highly skewed distributions, we proposed further optimization by allowing dynamically allocating free counters to the "large" items. We performed preliminary experiments to verify the optimization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • New estimation methods of Count-Min sketch

    Page(s): 73 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1056 KB) |  | HTML iconHTML  

    Count-Min sketch is an efficient approximate query tool for data stream. In this paper we address how to further improve its point query performance. Firstly, we modify the estimation method under cash register model. Our method will relieve error propagation. Secondly, we find better method under turnstile model and prove that our method is more efficient than that Count-Min sketch. These conclusions are well supported by experimental results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CDS-Tree: an effective index for clustering arbitrary shapes in data streams

    Page(s): 81 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (136 KB) |  | HTML iconHTML  

    Finding clusters of arbitrary shapes in data streams is a challenging work for advanced applications. An effective approach to clustering arbitrary shapes is the clustering algorithm based on space partition. However, it cannot be applied directly into data stream clustering since it costs large memory spaces while data stream processing has strict memory space limitation. In addition, it has low efficiency for high dimensional data and fine granularity. Moreover, its fixed granularity partition isn't suitable for the changes on data distribution of data streams. Therefore, we propose a novel index structure CDS-Tree and design an improved space partition based clustering algorithm, which aims to cluster arbitrary shapes on high dimension streams data with high accuracy. CDS-Tree stores only non-empty cells and keeps the position relationship among cells, so its compact structure costs small memory spaces and gets high efficiency. Moreover, we propose a novel measure for data skew - DSF (Data Skew Factor) to be used to adjust automatically the partition granularity according to the change of data streams, thus the algorithm can gain high analysis accuracy within limited memory. The experimental results on real datasets and synthetic datasets show that this algorithm has higher clustering accuracy, and better scalability with the size of windows and data dimensionality than other typical algorithms applied in trivial style. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): 89
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Freely Available from IEEE