Proceedings 2001 IEEE International Conference on Data Mining

29 Nov.-2 Dec. 2001

Filter Results

Displaying Results 1 - 25 of 111
  • Proceedings 2001 IEEE International Conference on Data Mining

    Publication Year: 2001
    Request permission for reuse | PDF file iconPDF (381 KB)
    Freely Available from IEEE
  • On effective conceptual indexing and similarity search in text data

    Publication Year: 2001, Page(s):3 - 10
    Cited by:  Papers (12)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (839 KB) | HTML iconHTML

    Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects, stem... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparisons of classification methods for screening potential compounds

    Publication Year: 2001, Page(s):11 - 18
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (780 KB) | HTML iconHTML

    We compare a number of data mining and statistical methods on the drug design problem of modeling molecular structure-activity relationships. The relationships can be used to identify active compounds based on their chemical structures from a large inventory of chemical compounds. The data set of this application has a highly skewed class distribution, in which only 2% of the compounds are conside... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Knowledge discovery from diagrammatically represented data

    Publication Year: 2001, Page(s):19 - 26
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (888 KB) | HTML iconHTML

    Knowledge discovery from diagrammatic data can be facilitated by a language that permits queries on such data. Such a language (Diagrammatic SQL) is being developed to expedite the development of an autonomous artificial intelligent agent with a capacity to deal with diagrammatic information. This language is described and examples of how it can be used to facilitate diagrammatic data mining are d... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating e-commerce and data mining: architecture and challenges

    Publication Year: 2001, Page(s):27 - 34
    Cited by:  Papers (20)  |  Patents (10)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (924 KB) | HTML iconHTML

    We show that the e-commerce domain can provide all the right ingredients for successful data mining. We describe an integrated architecture for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We emphasize the need for data collection at the app... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Classification with degree of membership: a fuzzy approach

    Publication Year: 2001, Page(s):35 - 42
    Cited by:  Papers (5)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (664 KB) | HTML iconHTML

    Classification is an important topic in data mining research. It is concerned with the prediction of the values of some attribute in a database based on other attributes. To tackle this problem, most of the existing data mining algorithms adopt either a decision tree based approach or an approach that requires users to provide some user-specified thresholds to guide the search for interesting rule... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Provably fast training algorithms for support vector machines

    Publication Year: 2001, Page(s):43 - 50
    Cited by:  Papers (5)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (808 KB) | HTML iconHTML

    Support vector machines are a family of data analysis algorithms based on convex quadratic programming. We focus on their use for classification: in that case, the SVM algorithms work by maximizing the margin of a classifying hyperplane in a feature space. The feature space is handled by means of kernels if the problems are formulated in dual form. Random sampling techniques successfully used for ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Who links to whom: mining linkage between Web sites

    Publication Year: 2001, Page(s):51 - 58
    Cited by:  Papers (40)  |  Patents (11)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (764 KB) | HTML iconHTML

    Previous studies of the Web graph structure have focused on the graph structure at the level of individual pages. In actuality the Web is a hierarchically nested graph, with domains, hosts and Web sites introducing intermediate levels of affiliation and administrative control. To better understand the growth of the Web we need to understand its macro-structure, in terms of the linkage between Web ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Better rules, fewer features: a semantic approach to selecting features from text

    Publication Year: 2001, Page(s):59 - 66
    Cited by:  Papers (20)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (809 KB) | HTML iconHTML

    The choice of features used to represent a domain has a profound effect on the quality of the model produced; yet, few researchers have investigated the relationship between the features used to represent text and the quality of the final model. We explored this relationship for medical texts by comparing association rules based on features with three different semantic levels: (1) words (2) manua... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Significance tests for patterns in continuous data

    Publication Year: 2001, Page(s):67 - 74
    Cited by:  Papers (1)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (712 KB) | HTML iconHTML

    The authors consider the question of uncertainty of detected patterns in data mining. In particular, we develop statistical tests for patterns found in continuous data, indicating the significance of these patterns in terms of the probability that they have occurred by chance. We examine the performance of these tests on patterns detected in several large data sets, including a data set describing... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed Web mining using Bayesian networks from multiple data streams

    Publication Year: 2001, Page(s):75 - 82
    Cited by:  Papers (7)  |  Patents (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (780 KB) | HTML iconHTML

    We present a collective approach to mining Bayesian networks from distributed heterogenous Web-log data streams. In this approach we first learn a local Bayesian network at each site using the local data. Then each site identifies the observations that are most likely to be evidence of coupling between local and non-local variables and transmits a subset of these observations to a central site. An... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A hypergraph based clustering algorithm for spatial data sets

    Publication Year: 2001, Page(s):83 - 90
    Cited by:  Papers (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (749 KB) | HTML iconHTML

    Clustering is a discovery process in data mining and can be used to group together the objects of a database into meaningful subclasses which serve as the foundation for other data analysis techniques. The authors focus on dealing with a set of spatial data. For the spatial data, the clustering problem becomes that of finding the densely populated regions of the space and thus grouping these regio... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient determination of dynamic split points in a decision tree

    Publication Year: 2001, Page(s):91 - 98
    Cited by:  Papers (5)  |  Patents (6)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (709 KB) | HTML iconHTML

    We consider the problem of choosing split points for continuous predictor variables in a decision tree. Previous approaches to this problem typically either: (1) discretize the continuous predictor values prior to learning, or (2) apply a dynamic method that considers all possible split points for each potential split. We describe a number of alternative approaches that generate a small number of ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient yet accurate clustering

    Publication Year: 2001, Page(s):99 - 106
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (750 KB) | HTML iconHTML

    The authors show that most hierarchical agglomerative clustering (HAC) algorithms follow a 90-10 rule where roughly 90% iterations from the beginning merge cluster pairs with dissimilarity less than 10% of the maximum dissimilarity. We propose two algorithms: 2-phase and nested, based on partially overlapping partitioning (POP). To handle high-dimensional data efficiently, we propose a tree struct... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A min-max cut algorithm for graph partitioning and data clustering

    Publication Year: 2001, Page(s):107 - 114
    Cited by:  Papers (240)  |  Patents (14)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (795 KB) | HTML iconHTML

    An important application of graph partitioning is data clustering using a graph model - the pairwise similarities between all data objects form a weighted graph adjacency matrix that contains all necessary information for clustering. In this paper, we propose a new algorithm for graph partitioning with an objective function that follows the min-max clustering principle. The relaxed version of the ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Preprocessing opportunities in optimal numerical range partitioning

    Publication Year: 2001, Page(s):115 - 122
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (960 KB) | HTML iconHTML

    We show that only segment borders have to be taken into account as cut point candidates when searching for the optimal multisplit of a numerical value range with respect to convex attribute evaluation functions. Segment borders can be found efficiently in a linear-time preprocessing step. With training set error, which is not strictly convex, the data can be preprocessed into an even smaller numbe... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using artificial anomalies to detect unknown and known network intrusions

    Publication Year: 2001, Page(s):123 - 130
    Cited by:  Papers (16)  |  Patents (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (745 KB) | HTML iconHTML

    Intrusion detection systems (IDSs) must be capable of detecting new and unknown attacks, or anomalies. We study the problem of building detection models for both pure anomaly detection and combined misuse and anomaly detection (i.e., detection of both known and unknown intrusions). We propose an algorithm to generate artificial anomalies to coerce the inductive learner into discovering an accurate... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using rule sets to maximize ROC performance

    Publication Year: 2001, Page(s):131 - 138
    Cited by:  Papers (35)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (740 KB) | HTML iconHTML

    Rules are commonly used for classification because they are modular intelligible and easy to learn. Existing work in classification rule learning assumes the goal is to produce categorical classifications to maximize classification accuracy. Recent work in machine learning has pointed out the limitations of classification accuracy: when class distributions are skewed or error costs are unequal, an... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A synchronization based algorithm for discovering ellipsoidal clusters in large datasets

    Publication Year: 2001, Page(s):139 - 146
    Cited by:  Papers (10)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (813 KB) | HTML iconHTML

    This paper introduces a new scalable approach to clustering based on the synchronization of pulse-coupled oscillators. Each data point is represented by an integrate-and-fire oscillator and the interaction between oscillators is defined according to the relative similarity between the points. The set of oscillators self-organizes into stable phase-locked subgroups. Our approach proceeds by loading... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Functional trees for classification

    Publication Year: 2001, Page(s):147 - 154
    Cited by:  Papers (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (720 KB) | HTML iconHTML

    The design of algorithms that explore multiple representation languages and explore different search spaces has an intuitive appeal. In the context of classification problems, algorithms that generate multivariate trees are able to explore multiple representation languages by using decision tests based on a combination of attributes. The same applies to model-tree algorithms in regression domains,... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A tight upper bound on the number of candidate patterns

    Publication Year: 2001, Page(s):155 - 162
    Cited by:  Papers (12)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (685 KB) | HTML iconHTML

    In the context of mining for frequent patterns using the standard level-wise algorithm, the following question arises: given the current level and the current set of frequent patterns, what is the maximal number of candidate patterns that can be generated on the next level? We answer this question by providing a tight upper bound, derived from a combinatorial result by J. Kruskal (1963) and G. Kat... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficiently mining maximal frequent itemsets

    Publication Year: 2001, Page(s):163 - 170
    Cited by:  Papers (125)  |  Patents (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (887 KB) | HTML iconHTML

    We present GenMax, a backtracking search based algorithm for mining maximal frequent itemsets. GenMax uses a number of optimizations to prune the search space. It uses a novel technique called progressive focusing to perform maximality checking, and diffset propagation to perform fast frequency computation. Systematic experimental comparison with previous work indicates that different methods have... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The DIAsDEM framework for converting domain-specific texts into XML documents with data mining techniques

    Publication Year: 2001, Page(s):171 - 178
    Cited by:  Papers (2)  |  Patents (1)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (856 KB) | HTML iconHTML

    Modern organizations are accumulating huge volumes of textual documents. To turn archives into valuable knowledge sources, textual content must become explicit and able to be queried. Semantic tagging with markup languages such as XML satisfies both requirements. We thus introduce the DIAsDEM* framework for extracting semantics from structural text units (e.g., sentences), assigning XML tags to th... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable algorithm for clustering sequential data

    Publication Year: 2001, Page(s):179 - 186
    Cited by:  Papers (31)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (932 KB) | HTML iconHTML

    In recent years, we have seen an enormous growth in the amount of available commercial and scientific data. Data from domains such as protein sequences, retail transactions, intrusion detection, and Web-logs have an inherent sequential nature. Clustering of such data sets is useful for various purposes. For example, clustering of sequences from commercial data sets may help marketer identify diffe... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Clustering validity assessment: finding the optimal partitioning of a data set

    Publication Year: 2001, Page(s):187 - 194
    Cited by:  Papers (80)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (1319 KB) | HTML iconHTML

    Clustering is a mostly unsupervised procedure and the majority of clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation regarding its validity. In this paper we present a clustering validity procedure, which evaluates the results of clusterin... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.