By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 11 • Date Nov. 2013

Filter Results

Displaying Results 1 - 19 of 19
  • A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams

    Page(s): 2421 - 2433
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1121 KB)  

    Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal component analysis (PCA) has been extensively applied to detecting anomalies in network data streams. However, none of existing PCA-based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we propose novel sparse PCA methods to perform anomaly detection and localization for network data streams. Our key observation is that we can localize anomalies by identifying a sparse low-dimensional space that captures the abnormal events in data streams. To better capture the sources of anomalies, we incorporate the structure information of the network stream data in our anomaly localization framework. Furthermore, we extend our joint sparse PCA framework with multidimensional Karhunen Loève Expansion that considers both spatial and temporal domains of data streams to stabilize localization performance. We have performed comprehensive experimental studies of the proposed methods and have compared our methods with the state-of-the-art using three real-world data sets from different application domains. Our experimental studies demonstrate the utility of the proposed methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Evaluation of Model-Based Approaches to Sensor Data Compression

    Page(s): 2434 - 2447
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1819 KB) |  | HTML iconHTML  

    As the volumes of sensor data being accumulated are likely to soar, data compression has become essential in a wide range of sensor-data applications. This has led to a plethora of data compression techniques for sensor data, in particular model-based approaches have been spotlighted due to their significant compression performance. These methods, however, have never been compared and analyzed under the same setting, rendering a "right" choice of compression technique for a particular application very difficult. Addressing this problem, this paper presents a benchmark that offers a comprehensive empirical study on the performance comparison of the model-based compression techniques. Specifically, we reimplemented several state-of-the-art methods in a comparable manner, and measured various performance factors with our benchmark, including compression ratio, computation time, model maintenance cost, approximation quality, and robustness to noisy data. We then provide in-depth analysis of the benchmark results, obtained by using 11 different real data sets consisting of 346 heterogeneous sensor data signals. We believe that the findings from the benchmark will be able to serve as a practical guideline for applications that need to compress sensor data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Category-Based Infidelity Bounded Queries over Unstructured Data Streams

    Page(s): 2448 - 2462
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (734 KB)  

    We present the Caicos system that supports continuous infidelity bounded queries over a data stream, where each data item (of the stream) belongs to multiple categories. Caicos is made up of four primitives: Keywords, Queries, Data items, and Categories. A Category is a virtual entity consisting of all those data items that belong to it. The membership of a data item to a category is decided by evaluating a Boolean predicate (associated with each category) over the data item. Each data item and query in turn are associated with multiple keywords. Given a keyword query, unlike conventional unstructured data querying techniques that return the top-(K) documents, Caicos returns the top-(K) categories with infidelity less than the user specified infidelity bound. Caicos is designed to continuously track the evolving information present in a highly dynamic data stream. It, hence, computes the relevance of a category to the continuous keyword query using the data items occurring in the stream in the recent past (i.e., within the current "window"). To efficiently provide up-to-date answers to the continuous queries, Caicos needs to maintain the required metadata accurately. This requires addressing two subproblems: 1) Identifying the "right" metadata that needs to be updated for providing accurate results and 2) updating the metadata in an efficient manner. We show that the problem of identifying the right metadata can be further broken down into two subparts. We model the first subpart as an inequality constrained minimization problem and propose an innovative iterative algorithm for the same. The second subpart requires us to build an efficient dynamic programming-based algorithm, which helps us to find the right metadata that needs to be updated. Updating the metadata on multiple processors is a scheduling problem whose complexity is exponential in the length of the input. An approximate multiprocessor scheduling algorithm is, hence, proposed. Experimental evaluation o- Caicos using real-world dynamic data shows that Caicos is able to provide fidelity close to 100 percent using 45 percent less resources than the techniques proposed in the literature. This ability of Caicos to work accurately and efficiently even in scenarios with high data arrival rates makes it suitable for data intensive application domains. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dealing with Uncertainty: A Survey of Theories and Practices

    Page(s): 2463 - 2482
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2114 KB) |  | HTML iconHTML  

    Uncertainty accompanies our life processes and covers almost all fields of scientific studies. Two general categories of uncertainty, namely, aleatory uncertainty and epistemic uncertainty, exist in the world. While aleatory uncertainty refers to the inherent randomness in nature, derived from natural variability of the physical world (e.g., random show of a flipped coin), epistemic uncertainty origins from human's lack of knowledge of the physical world, as well as ability of measuring and modeling the physical world (e.g., computation of the distance between two cities). Different kinds of uncertainty call for different handling methods. Aggarwal, Yu, Sarma, and Zhang et al. have made good surveys on uncertain database management based on the probability theory. This paper reviews multidisciplinary uncertainty processing activities in diverse fields. Beyond the dominant probability theory and fuzzy theory, we also review information-gap theory and recently derived uncertainty theory. Practices of these uncertainty handling theories in the domains of economics, engineering, ecology, and information sciences are also described. It is our hope that this study could provide insights to the database community on how uncertainty is managed in other disciplines, and further challenge and inspire database researchers to develop more advanced data management techniques and tools to cope with a variety of uncertainty issues in the real world. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed Autonomous Online Learning: Regrets and Intrinsic Privacy-Preserving Properties

    Page(s): 2483 - 2493
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB) |  | HTML iconHTML  

    Online learning has become increasingly popular on handling massive data. The sequential nature of online learning, however, requires a centralized learner to store data and update parameters. In this paper, we consider online learning with distributed data sources. The autonomous learners update local parameters based on local data sources and periodically exchange information with a small subset of neighbors in a communication network. We derive the regret bound for strongly convex functions that generalizes the work by Ram et al. for convex functions. More importantly, we show that our algorithm has intrinsic privacy-preserving properties, and we prove the sufficient and necessary conditions for privacy preservation in the network. These conditions imply that for networks with greater-than-one connectivity, a malicious learner cannot reconstruct the subgradients (and sensitive raw data) of other learners, which makes our algorithm appealing in privacy-sensitive applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Cluster Labeling for Support Vector Clustering

    Page(s): 2494 - 2506
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3282 KB) |  | HTML iconHTML  

    We propose a new efficient algorithm for solving the cluster labeling problem in support vector clustering (SVC). The proposed algorithm analyzes the topology of the function describing the SVC cluster contours and explores interconnection paths between critical points separating distinct cluster contours. This process allows distinguishing disjoint clusters and associating each point to its respective one. The proposed algorithm implements a new fast method for detecting and classifying critical points while analyzing the interconnection patterns between them. Experiments indicate that the proposed algorithm significantly improves the accuracy of the SVC labeling process in the presence of clusters of complex shape, while reducing the processing time required by existing SVC labeling algorithms by orders of magnitude. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Index-Based Approaches for Skyline Queries in Location-Based Applications

    Page(s): 2507 - 2520
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1152 KB) |  | HTML iconHTML  

    Enriching many location-based applications, various new skyline queries are proposed and formulated based on the notion of locational dominance, which extends conventional one by taking objects' nearness to query positions into account additional to objects' nonspatial attributes. To answer a representative class of skyline queries for location-based applications efficiently, this paper presents two index-based approaches, namely, augmented R-tree and dominance diagram. Augmented R-tree extends R-tree by including aggregated nonspatial attributes in index nodes to enable dominance checks during index traversal. Dominance diagram is a solution-based approach, by which each object is associated with a precomputed nondominance scope wherein query points should have the corresponding object not locationally dominated by any other. Dominance diagram enables skyline queries to be evaluated via parallel and independent comparisons between nondominance scopes and query points, providing very high search efficiency. The performance of these two approaches is evaluated via empirical studies, in comparison with other possible approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Skyline Computation on Big Data

    Page(s): 2521 - 2535
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1249 KB) |  | HTML iconHTML  

    Skyline is an important operation in many applications to return a set of interesting points from a potentially huge data space. Given a table, the operation finds all tuples that are not dominated by any other tuples. It is found that the existing algorithms cannot process skyline on big data efficiently. This paper presents a novel skyline algorithm SSPL on big data. SSPL utilizes sorted positional index lists which require low space overhead to reduce I/O cost significantly. The sorted positional index list Lj is constructed for each attribute Aj and is arranged in ascending order of Aj. SSPL consists of two phases. In phase 1, SSPL computes scan depth of the involved sorted positional index lists. During retrieving the lists in a round-robin fashion, SSPL performs pruning on any candidate positional index to discard the candidate whose corresponding tuple is not skyline result. Phase 1 ends when there is a candidate positional index seen in all of the involved lists. In phase 2, SSPL exploits the obtained candidate positional indexes to get skyline results by a selective and sequential scan on the table. The experimental results on synthetic and real data sets show that SSPL has a significant advantage over the existing skyline algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incremental Maintenance of the Minimum Bisimulation of Cyclic Graphs

    Page(s): 2536 - 2550
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1538 KB)  

    There have been numerous recent applications of graph databases (e.g., the Semantic Web, ontology representation, social networks, XML, chemical databases, and biological databases). A fundamental structural index for data graphs, namely minimum bisimulation, has been reported useful for efficient path query processing and optimization including selectivity estimation, among many others. Data graphs are subject to change and their indexes are updated accordingly. This paper studies the incremental maintenance problem of the minimum bisimulation of a possibly cyclic data graph. While cyclic graphs are ubiquitous among the data on the web, previous work on the maintenance problem has mostly focused on acyclic graphs. To study the problem with cyclic graphs, we first show that the two existing classes of minimization algorithms - merging algorithm and partition refinement - have their strengths and weaknesses. Second, we propose a novel hybrid algorithm and its analytical model. This algorithm supports an edge insertion or deletion and two forms of batch insertions or deletions. To the best of our knowledge, this is the first maintenance algorithm that guarantees minimum bisimulation of cyclic graphs. Third, we propose to partially reuse the minimum bisimulation before an update in order to optimize maintenance performance. We present an experimental study on both synthetic and real-data graphs that verified the efficiency and effectiveness of our algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large-Scale Personalized Human Activity Recognition Using Online Multitask Learning

    Page(s): 2551 - 2563
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1658 KB) |  | HTML iconHTML  

    Personalized activity recognition usually has the problem of highly biased activity patterns among different tasks/persons. Traditional methods face problems on dealing with those conflicted activity patterns. We try to effectively model the activity patterns among different persons via casting this personalized activity recognition problem as a multitask learning issue. We propose a novel online multitask learning method for large-scale personalized activity recognition. In contrast with existing work of multitask learning that assumes fixed task relationships, our method can automatically discover task relationships from real-world data. Convergence analysis shows reasonable convergence properties of the proposed method. Experiments on two different activity data sets demonstrate that the proposed method significantly outperforms existing methods in activity recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multidimensional Sequence Classification Based on Fuzzy Distances and Discriminant Analysis

    Page(s): 2564 - 2575
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1238 KB)  

    In this paper, we present a novel method aiming at multidimensional sequence classification. We propose a novel sequence representation, based on its fuzzy distances from optimal representative signal instances, called statemes. We also propose a novel modified clustering discriminant analysis algorithm minimizing the adopted criterion with respect to both the data projection matrix and the class representation, leading to the optimal discriminant sequence class representation in a low-dimensional space, respectively. Based on this representation, simple classification algorithms, such as the nearest subclass centroid, provide high classification accuracy. A three step iterative optimization procedure for choosing statemes, optimal discriminant subspace and optimal sequence class representation in the final decision space is proposed. The classification procedure is fast and accurate. The proposed method has been tested on a wide variety of multidimensional sequence classification problems, including handwritten character recognition, time series classification and human activity recognition, providing very satisfactory classification results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Nonparametric Ordinal Classification with Monotonicity Constraints

    Page(s): 2576 - 2589
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (805 KB)  

    We consider the problem of ordinal classification with monotonicity constraints. It differs from usual classification by handling background knowledge about ordered classes, ordered domains of attributes, and about a monotonic relationship between an evaluation of an object on the attributes and its class assignment. In other words, the class label (output variable) should not decrease when attribute values (input variables) increase. Although this problem is of great practical importance, it has received relatively low attention in machine learning. Among existing approaches to learning with monotonicity constraints, the most general is the nonparametric approach, where no other assumption is made apart from the monotonicity constraints assumption. The main contribution of this paper is the analysis of the nonparametric approach from statistical point of view. To this end, we first provide a statistical framework for classification with monotonicity constraints. Then, we focus on learning in the nonparametric setting, and we consider two approaches: the "plug-in" method (classification by estimating first the class conditional distribution) and the direct method (classification by minimization of the empirical risk). We show that these two methods are very closely related. We also perform a thorough theoretical analysis of their statistical and computational properties, confirmed in a computational experiment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Practical Ensemble Classification Error Bounds for Different Operating Points

    Page(s): 2590 - 2601
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (949 KB) |  | HTML iconHTML  

    Classification algorithms used to support the decisions of human analysts are often used in settings in which zero-one loss is not the appropriate indication of performance. The zero-one loss corresponds to the operating point with equal costs for false alarms and missed detections, and no option for the classifier to leave uncertain test samples unlabeled. A generalization bound for ensemble classification at the standard operating point has been developed based on two interpretable properties of the ensemble: strength and correlation, using the Chebyshev inequality. Such generalization bounds for other operating points have not been developed previously and are developed in this paper. Significantly, the bounds are empirically shown to have much practical utility in determining optimal parameters for classification with a reject option, classification for ultralow probability of false alarm, and classification for ultralow probability of missed detection. Counter to the usual guideline of large strength and small correlation in the ensemble, different guidelines are recommended by the derived bounds in the ultralow false alarm and missed detection probability regimes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Privacy Preserving Policy-Based Content Sharing in Public Clouds

    Page(s): 2602 - 2614
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (445 KB) |  | HTML iconHTML  

    An important problem in public clouds is how to selectively share documents based on fine-grained attribute-based access control policies (acps). An approach is to encrypt documents satisfying different policies with different keys using a public key cryptosystem such as attribute-based encryption, and/or proxy re-encryption. However, such an approach has some weaknesses: it cannot efficiently handle adding/revoking users or identity attributes, and policy changes; it requires to keep multiple encrypted copies of the same documents; it incurs high computational costs. A direct application of a symmetric key cryptosystem, where users are grouped based on the policies they satisfy and unique keys are assigned to each group, also has similar weaknesses. We observe that, without utilizing public key cryptography and by allowing users to dynamically derive the symmetric keys at the time of decryption, one can address the above weaknesses. Based on this idea, we formalize a new key management scheme, called broadcast group key management (BGKM), and then give a secure construction of a BGKM scheme called ACV-BGKM. The idea is to give some secrets to users based on the identity attributes they have and later allow them to derive actual symmetric keys based on their secrets and some public information. A key advantage of the BGKM scheme is that adding users/revoking users or updating acps can be performed efficiently by updating only some public information. Using our BGKM construct, we propose an efficient approach for fine-grained encryption-based access control for documents stored in an untrusted cloud file storage. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Profiling Moving Objects by Dividing and Clustering Trajectories Spatiotemporally

    Page(s): 2615 - 2628
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2185 KB) |  | HTML iconHTML  

    An object can move with various speeds and arbitrarily changing directions. Given a bounded area where a set of objects moving around, there are some typical moving styles of the objects at different local regions due to the geography nature or other spatiotemporal conditions. Not only the paths that the objects move along, we also want to know how different groups of objects move with various speeds. Therefore, given a set of collected trajectories spreading in a bounded area, we are interested in discovering the typical moving styles in different regions of all the monitored moving objects. These regional typical moving styles are regarded as the profile of the monitored moving objects, which may help reflect the geoinformation of the observed area and the moving behaviors of the observed moving objects. In this paper, we present DivCluST, an approach to finding regional typical moving styles by dividing and clustering the trajectories in consideration of both the spatial and temporal constraints. Different from the existing works that consider only the spatial properties or just the interesting regions of trajectories, DivCluST focuses more on typical movements in local regions of a bounded area and takes the temporal information into account when designing the criteria for trajectory dividing and the distance measurement for adaptive $(k)$-means clustering. Extensive experiments on three types of real data sets with specially designed visualization are presented to show the effectiveness of DivCluST. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The BoND-Tree: An Efficient Indexing Method for Box Queries in Nonordered Discrete Data Spaces

    Page(s): 2629 - 2643
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1497 KB) |  | HTML iconHTML  

    Box queries (or window queries) are a type of query which specifies a set of allowed values in each dimension. Indexing feature vectors in the multidimensional Nonordered Discrete Data Spaces (NDDS) for efficient box queries are becoming increasingly important in many application domains such as genome sequence databases. Most of the existing work in this field targets the similarity queries (range queries and k-NN queries). Box queries, however, are fundamentally different from similarity queries. Hence, the same indexing schemes designed for similarity queries may not be efficient for box queries. In this paper, we present a new indexing structure specifically designed for box queries in the NDDS. Unique characteristics of the NDDS are exploited to develop new node splitting heuristics. For the BoND-tree, we also provide theoretical analysis to show the optimality of the proposed heuristics. Extensive experiments with synthetic data demonstrate that the proposed scheme is significantly more efficient than the existing ones when applied to support box queries in NDDSs. We also show effectiveness of the proposed scheme in a real-world application of primer design for genome sequence databases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward Efficient Filter Privacy-Aware Content-Based Pub/Sub Systems

    Page(s): 2644 - 2657
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (744 KB)  

    In recent years, the content-based publish/subscribe [12], [22] has become a popular paradigm to decouple information producers and consumers with the help of brokers. Unfortunately, when users register their personal interests to the brokers, the privacy pertaining to filters defined by honest subscribers could be easily exposed by untrusted brokers, and this situation is further aggravated by the collusion attack between untrusted brokers and compromised subscribers. To protect the filter privacy, we introduce an anonymizer engine to separate the roles of brokers into two parts, and adapt the k-anonymity and `-diversity models to the contentbased pub/sub. When the anonymization model is applied to protect the filter privacy, there is an inherent tradeoff between the anonymization level and the publication redundancy. By leveraging partial-order-based generalization of filters to track filters satisfying k-anonymity and ℓ-diversity, we design algorithms to minimize the publication redundancy. Our experiments show the proposed scheme, when compared with studied counterparts, has smaller forwarding cost while achieving comparable attack resilience. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bias Correction in a Small Sample from Big Data

    Page(s): 2658 - 2663
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (444 KB) |  | HTML iconHTML  

    This paper discusses the bias problem when estimating the population size of big data such as online social networks (OSN) using uniform random sampling and simple random walk. Unlike the traditional estimation problem where the sample size is not very small relative to the data size, in big data, a small sample relative to the data size is already very large and costly to obtain. We point out that when small samples are used, there is a bias that is no longer negligible. This paper shows analytically that the relative bias can be approximated by the reciprocal of the number of collisions; thereby, a bias correction estimator is introduced. The result is further supported by both simulation studies and the real Twitter network that contains 41.7 million nodes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Open Access Publishing

    Page(s): 2664
    Save to Project icon | Request Permissions | PDF file iconPDF (413 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University