By Topic

Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on

Date 11-15 April 2011

Filter Results

Displaying Results 1 - 25 of 53
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (4596 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): ii
    Save to Project icon | Request Permissions | PDF file iconPDF (82 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): iii - x
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • Symposium on Computational Intelligence and Data Mining (IEEE CIDM 2011)

    Page(s): xi - xii
    Save to Project icon | Request Permissions | PDF file iconPDF (52 KB)  
    Freely Available from IEEE
  • Multiple query-dependent RankSVM aggregation for document retrieval

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (147 KB) |  | HTML iconHTML  

    This paper is concerned with supervised rank aggregation, which aims to improve the ranking performance by combining the outputs from multiple rankers. However, there are two main shortcomings in previous rank aggregation approaches. Firstly, the learned weights for base rankers do not distinguish the differences among queries. This is suboptimal since queries vary significantly in terms of ranking. Besides, most current aggregation functions are unsupervised. A supervised aggregation function could further improve the ranking performance. In this paper, the significant difference existing among queries is taken into consideration, and a supervised rank aggregation approach is proposed. As a case study, we employ RankSVM model to aggregate the base rankers, referred to as Q.D.RSVM, and prove that Q.D.RSVM can set up query-dependent weights for different base rankers. Experimental results based on benchmark datasets show our approach outperforms conventional ranking approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active learning using the data distribution for interactive image classification and retrieval

    Page(s): 7 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (386 KB) |  | HTML iconHTML  

    In the context of image search and classification, we describe an active learning strategy that relies on the intrinsic data distribution modeled as a mixture of Gaussians to speed up the learning of the target class using an interactive relevance feedback process. The contributions of our work are twofold: First, we introduce a new form of a semi-supervised C-SVM algorithm that exploits the intrinsic data distribution by working directly on equiprobable envelopes of Gaussian mixture components. Second, we introduce an active learning strategy which allows to interactively adjust the equiprobable envelopes in a small number of feedback steps. The proposed method allows the exploitation of the information contained in the unlabeled data and does not suffer from the drawbacks inherent to semi-supervised methods, e.g. computation time and memory requirements. Tests performed on a database of high-resolution satellite images and on a database of color images show that our system compares favorably, in terms of learning speed and ability to manage large volumes of data, to the classic approach using SVM active learning. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Geodesic distances for web document clustering

    Page(s): 15 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (143 KB) |  | HTML iconHTML  

    While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distinguishing defined concepts from prerequisite concepts in learning resources

    Page(s): 22 - 29
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    The objective of any tutoring system is to provide meaningful learning to the learner, thence it is important to know whether a concept mentioned in a document is a prerequisite for studying that document, or it can be learned from it. In this paper, we study the problem of identifying defined concepts and prerequisite concepts from learning resources available on the web. Statistics and machine learning tools are exploited in order to predict the class of each concept. Two groups of features are constructed to categorize the concepts: contextual features and local features. The contextual features enclose linguistic information and the local features contain the concept properties such as font size and font weigh. An aggregation method is proposed as a solution to the problem of the multiple occurrences of a defined concept in a document. This paper shows that best results are obtained with the SVM classifier than with other classifiers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feature extraction for multi-label learning in the domain of email classification

    Page(s): 30 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (119 KB) |  | HTML iconHTML  

    Multi-label learning is a very interesting field in Machine Learning. It allows to generalise standard methods and evaluation procedures, and tackle challenging real problems where one example can be tagged with more than one label. In this paper we study the performance of different multi-label methods in combination with standard single-label algorithms, using several specific multi-label metrics. What we want to show is how a good preprocessing phase can improve the performance of such methods and algorithms. As we will explain, its main advantage is a shorter time to induce the models, while keeping (even improving) other classification quality measures. We use the GNUsmail framework to do the preprocessing of an existing and extensively used dataset, to obtain a reduced feature space that conserves the relevant information and allows improvements on performance. Thanks to the capabilities of GNUsmail, the preprocessing step can be easily applied to different email datasets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the use of decision trees for ICU outcome prediction in sepsis patients treated with statins

    Page(s): 37 - 43
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (753 KB) |  | HTML iconHTML  

    Sepsis is one of the main causes of death for noncoronary ICU (Intensive Care Unit) patients and has become the tenth most common cause of death in western societies. This is a transversal condition affecting immunocompromised patients, critically ill patients, post-surgery patients, patients with AIDS, and the elderly. In western countries, septic patients account for as much as 25% of ICU bed utilization and the pathology affects 1% - 2% of all hospitalizations. Its mortality rates range from 12.8% for sepsis to 45.7% for septic shock. Early administration of antibiotics is known to be crucial for ICU outcomes. In this regard, statins, a class of drug, have been shown to present good anti-inflammatory properties beyond their regulation of the biosynthesis of cholesterol. In this brief paper, we hypothesize that preadmission use of statins improves ICU outcomes. We test this hypothesis in a prospective study in patients admitted with severe sepsis and multiorgan failure at the ICU of Vall d' Hebron University Hospital (Barcelona, Spain), using statistic algebraic models and regression trees. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive treatment of anemia on hemodialysis patients: A reinforcement learning approach

    Page(s): 44 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (165 KB) |  | HTML iconHTML  

    The aim of this work is to study the applicability of reinforcement learning methods to design adaptive treatment strategies that optimize, in the long-term, the dosage of erythropoiesis-stimulating agents (ESAs) in the management of anemia in patients undergoing hemodialysis. Adaptive treatment strategies are recently emerging as a new paradigm for the treatment and long-term management of the chronic disease. Reinforcement Learning (RL) can be useful to extract such strategies from clinical data, taking into account delayed effects and without requiring any mathematical model. In this work, we focus on the so-called Fitted Q Iteration algorithm, a RL approach that deals with the data very efficiently. Achieved results show the suitability of the proposed RL policies that can improve the performance of the treatment followed in the clinics. The methodology can be easily extended to other problems of drug dosage optimization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Clustering categorical data: A stability analysis framework

    Page(s): 58 - 65
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (250 KB) |  | HTML iconHTML  

    Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with `noisy' data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-Biclustering Combinatorial Based algorithm

    Page(s): 66 - 71
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (217 KB) |  | HTML iconHTML  

    In the last years a large amount of information about genomes was discovered, increasing the complexity of analysis. Therefore the most advanced techniques and algorithms are required. In many cases researchers use unsupervised clustering. But the inability of clustering to solve a number of tasks requires new algorithms. So, recently, scientists turned their attention to the biclustering techniques. In this paper we propose a novel biclustering technique, that we call Combinatorial Biclustering Algorithm (BCA). This technique permits to solve the following problems: 1) classification of data with respect to rows and columns together; 2) discovering of the overlapped biclusters; 3) definition of the minimal number of rows and columns in biclusters; 4) finding all biclusters together. We apply our model to two synthetic and one real biological data sets and show the results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A recommendation algorithm using positive and negative latent models

    Page(s): 72 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (242 KB) |  | HTML iconHTML  

    This paper proposes an algorithm for recommender systems that uses both positive and negative latent user models. In recommending items to a user, recommender systems usually exploit item content information as well as the preferences of similar users. Various types of content information can be attached to items and these are useful for judging user preferences. For example, in movie recommendations, a movie record may include the director, the actors, and reviews. These types of information help systems calculate sophisticated user preferences. We first propose a probabilistic model that maps multi-attributed records into a low-dimensional feature space. The proposed model extends latent Dirichlet allocation to the handling of multi-attributed data. We derive an algorithm for estimating the model's parameters using the Gibbs sampling technique. Next, we propose a probabilistic model to calculate user preferences for items in the feature space. Finally, we develop a recommendation algorithm based on the probabilistic model that works efficiently for large quantities of items and user ratings. We use a publicly available movie corpus to evaluate the proposed algorithm empirically, in terms of both its recommendation accuracy and its processing efficiency. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using gaming strategies for attacker and defender in recommender systems

    Page(s): 80 - 87
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (354 KB) |  | HTML iconHTML  

    Ratings are the prominent factors to decide the fate of any product in the present Internet Market and many people follow the ratings in a genuine sense. Unfortunately, the Sibyl attacks can affect the credibility of the genuine product. Influence limiter algorithms in recommender systems have been used extensively to overcome the Sibyl attacks but the effort could not reach the safe mark. This paper highlights an approach to generating gaming strategies for the attacker and defender in a recommender system. In a given recommender system environment, attackers and defenders play the most crucial part in a gaming strategy. A sequence of decision rules that an attacker or defender may use to achieve their desired goal is represented in these strategies involved in the game theory. The valid approaches to avoid the Sibyl attacks from the attackers are efficiently defended by the defenders. In our approach, we define attack graphs, use cases, and misuses cases in our gaming framework to analyze the vulnerabilities and security measures incorporated in a recommender system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active classifier training with the 3DS strategy

    Page(s): 88 - 95
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (195 KB) |  | HTML iconHTML  

    In this article, we introduce and investigate 3DS, a novel selection strategy for pool-based active training of a generative classifier, namely CMM (classifier based on a probabilistic mixture model). Such a generative classifier aims at modeling the processes underlying the “generation” of the data. The strategy 3DS considers the distance of samples to the decision boundary, the density in regions where samples are selected, and the diversity of samples in the query set that are chosen for labeling, e.g., by a human domain expert. The combination of the three measures in 3DS is adaptive in the sense that the weights of the distance and the density measure depend on the uniqueness of the classification. With nine benchmark data sets it is shown that 3DS outperforms a random selection strategy (baseline method), a pure closest sampling approach, ITDS (information theoretic diversity sampling), DWUS (density-weighted uncertainty sampling), DUAL (dual strategy for active learning), and PBAC (prototype based active learning) regarding evaluation criteria such as ranked performance based on classification accuracy, number of labeled samples (data utilization), and learning speed assessed by the area under the learning curve. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Partially supervised k-harmonic means clustering

    Page(s): 96 - 103
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (190 KB) |  | HTML iconHTML  

    A popular algorithm for finding clusters in unlabeled data optimizes the k-means clustering model. This algorithm converges quickly but is sensitive to initialization. Two ways to overcome this drawback are fuzzification and harmonic means. We show that k-harmonic means is a special case of reformulated fuzzy k-means. The main focus of this paper is on partially supervised clustering. Partially supervised clustering finds clusters in data sets that contain both unlabeled and labeled data. We review partially supervised k-means, partially supervised fuzzy k-means, and introduce a partially supervised extension of k-harmonic means. Experiments with four benchmark data sets indicate that partially supervised k-harmonic means inherits the advantages of its completely unsupervised variant: It is significantly less sensitive to initialization than partially supervised k-means. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Local neighbourhood extension of SMOTE for mining imbalanced data

    Page(s): 104 - 111
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (124 KB) |  | HTML iconHTML  

    In this paper we discuss problems of inducing classifiers from imbalanced data and improving recognition of minority class using focused resampling techniques. We are particularly interested in SMOTE over-sampling method that generates new synthetic examples from the minority class between the closest neighbours from this class. However, SMOTE could also overgeneralize the minority class region as it does not consider distribution of other neighbours from the majority classes. Therefore, we introduce a new generalization of SMOTE, called LN-SMOTE, which exploits more precisely information about the local neighbourhood of the considered examples. In the experiments we compare this method with original SMOTE and its two, the most related, other generalizations Borderline and Safe-Level SMOTE. All these pre-processing methods are applied together with either decision tree or Naive Bayes classifiers. The results show that the new LN-SMOTE method improves evaluation measures for the minority class. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FGMAC: Frequent subgraph mining with Arc Consistency

    Page(s): 112 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (228 KB) |  | HTML iconHTML  

    With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Among the various kinds of graph patterns, frequent subgraphs seem to be relevant in characterizing graphsets, discriminating different groups of sets, and classifying and clustering graphs. Because of the NP-Completeness of subgraph isomorphism test as well as the huge search space, fragment miners are exponential in runtime and/or memory consumption. In this paper we study a new polynomial projection operator named AC-Projection based on a key technique of constraint programming namely Arc Consistency (AC). This is intended to replace the use of the exponential subgraph isomorphism. We study the relevance of frequent AC-reduced graph patterns on classification and we prove that we can achieve an important performance gain without or with non-significant loss of discovered pattern's quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mining toxicity structural alerts from SMILES: A new way to derive Structure Activity Relationships

    Page(s): 120 - 127
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB) |  | HTML iconHTML  

    Encouraged by recent legislations all over the world, aimed to protect human health and environment, in silico techniques have proved their ability to assess the toxicity of chemicals. However, they act often like a black-box, without giving a clear contribution to the scientific insight; such over-optimized methods may be beyond understanding, behaving more like competitors of human experts' knowledge, rather than assistants. In this work, a new Structure-Activity Relationship (SAR) approach is proposed to mine molecular fragments that act like structural alerts for biological activity. The entire process is designed to fit with human reasoning, not only to make its predictions more reliable, but also to enable a clear control by the user, in order to match customized requirements. Such an approach has been implemented and tested on the mutagenicity endpoint, showing marked prediction skills and, more interestingly, discovering much of the knowledge already collected in literature as well as new evidences. The achieved tool is a powerful instrument for both SAR knowledge discovery and for activity prediction on untested compounds. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Logistic sub-models for small size populations in credit scoring

    Page(s): 128 - 134
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (171 KB) |  | HTML iconHTML  

    The credit scoring risk management is a fast growing field due to consumer's credit requests. Credit requests, of new and existing customers, are often evaluated by classical discrimination rules based on customers information. However, these kinds of strategies have serious limits and don't take into account the characteristics difference between current customers and the future ones. The aim of this paper is to measure credit worthiness for non customers borrowers and to model potential risk given a heterogeneous population formed by borrowers customers of the bank and others who are not. We hold on previous works done in generalized discrimination and transpose them into the logistic model to bring out efficient discrimination rules for non customers' subpopulation. Therefore we obtain seven simple models of connection between parameters of both logistic models associated respectively to the two subpopulations. The German credit data set is selected as the experimental data to compare the seven models. Experimental results show that the use of links between the two subpopulations improve the classification accuracy for the new loan applicants. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Periodic quick test for classifying long-term activities

    Page(s): 135 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (169 KB) |  | HTML iconHTML  

    A novel method to classify long-term human activities is presented in this study. The method consists of two parts: quick test and periodic classification. The quick test uses temporal information to improve recognition accuracy, while the periodic classification is based on the assumption that recognized activities are long-term. Periodic quick test (PQT) classification was tested using a data set consisting of six long-term sports exercises. The data were collected from six persons wearing a two-dimensional accelerometer on their wrist. The results show that the presented method is not only faster than a normal method, that does not use temporal information and does not assume that activities are long-term, but also more accurate. The results were compared with a normal sliding window technique which divides signal into smaller sequences and classifies each sequence into one of the six classes. The classification accuracy using a normal method was around 84% while using PQT the recognition rate was over 90%. In addition, the number of classified sequences using a normal method was over six times higher than using PQT. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data mining driven agents for predicting online auction's end price

    Page(s): 141 - 147
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB) |  | HTML iconHTML  

    Auctions can be characterized by distinct nature of their feature space. This feature space may include opening price, closing price, average bid rate, bid history, seller and buyer reputation, number of bids and many more. In this paper, a clustering based method is used to forecast the end-price of an online auction for autonomous agent based system. In the proposed model, the input auction space is partitioned into groups of similar auctions by k-means clustering algorithm. The recurrent problem of finding the value of k in k-means algorithm is solved by employing elbow method using one way analysis of variance (ANOVA). Then k numbers of regression models are employed to estimate the forecasted price of an online auction. Based on the transformed data after clustering and the characteristics of the current auction, bid selector nominates the regression model for the current auction whose price is to be forecasted. Our results show the improvements in the end price prediction for each cluster which support in favor of the proposed clustering based model for the bid prediction in the online auction environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A robust F-measure for evaluating discovered process models

    Page(s): 148 - 155
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (146 KB) |  | HTML iconHTML  

    Within process mining research, one of the most important fields of study is process discovery, which can be defined as the extraction of control-flow models from audit trails or information system event logs. The evaluation of discovered process models is an essential but difficult task for any process discovery analysis. With this paper, we propose a novel approach for evaluating discovered process models based on artificially generated negative events. This approach allows for the definition of a behavioral F-measure for discovered process models, which is the main contribution of this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient accelerometer-based swimming exercise tracking

    Page(s): 156 - 161
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (101 KB) |  | HTML iconHTML  

    The study concentrates on tracking swimming exercises based on the data of 3D accelerometer and shows that human activities can be tracked accurately using low sampling rates. The tracking of swimming exercise is done in three phases: first the swimming style and turns are recognized, secondly the number of strokes are counted and thirdly the intensity of swimming is estimated. Tracking is done using efficient methods because the methods presented in the study are designed for light applications which do not allow heavy computing. To keep tracking as light as possible it is studied what is the lowest sampling frequency that can be used and still obtain accurate results. Moreover, two different sensor placements (wrist and upper back) are compared. The results of the study show that tracking can be done with high accuracy using simple methods that are fast to calculate and with a really low sampling frequency. It is shown that an upper back-worn sensor is more accurate than a wrist-worn one when the swimming style is recognized, but when the number of strokes is counted and intensity estimated, the sensors give approximately equally accurate results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.