By Topic

Big Data (BigData Congress), 2013 IEEE International Congress on

Date June 27 2013-July 2 2013

Filter Results

Displaying Results 1 - 25 of 80
  • [Front cover]

    Publication Year: 2013 , Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (877 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2013 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (92 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2013 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (145 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2013 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (122 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2013 , Page(s): v - x
    Save to Project icon | Request Permissions | PDF file iconPDF (174 KB)  
    Freely Available from IEEE
  • Message from the General and Program Chairs

    Publication Year: 2013 , Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (154 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Organizing Committee

    Publication Year: 2013 , Page(s): xii - xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (148 KB)  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2013 , Page(s): xv - xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (116 KB)  
    Freely Available from IEEE
  • Support team

    Publication Year: 2013 , Page(s): xvii
    Save to Project icon | Request Permissions | PDF file iconPDF (105 KB)  
    Freely Available from IEEE
  • IEEE Computer Society Technical Committee on Services Computing

    Publication Year: 2013 , Page(s): xviii
    Save to Project icon | Request Permissions | PDF file iconPDF (249 KB)  
    Freely Available from IEEE
  • A Database-Hadoop Hybrid Approach to Scalable Machine Learning

    Publication Year: 2013 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (406 KB) |  | HTML iconHTML  

    There are two popular schools of thought for performing large-scale machine learning that does not fit into memory. One is to run machine learning within a relational database management system, and the other is to push analytical functions into MapReduce. As each approach has its own set of pros and cons, we propose a database-Hadoop hybrid approach to scalable machine learning where batch-learning is performed on the Hadoop platform, while incremental-learning is performed on PostgreSQL. We propose a purely relational approach that removes the scalability limitation of previous approaches based on user-defined aggregates and also discuss issues and resolutions in applying the proposed approach to Hadoop/Hive. Experimental evaluations of classification performance and training speed were conducted using a commercial advertisement dataset provided in the KDD Cup 2012, Track 2. The experimental results show that our scheme has competitive classification performance and superior training speed compared with state-of-the-art scalable machine learning frameworks, 5 and 7.65 times faster than Vow pal Wabbit and Bismarck, respectively, for a regression task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Overhead among Three Hypervisors: An Experimental Study Using Hadoop Benchmarks

    Publication Year: 2013 , Page(s): 9 - 16
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (173 KB) |  | HTML iconHTML  

    Hyper visors are widely used in cloud environments and their impact on application performance has been a topic of significant research and practical interest. We conducted experimental measurements of several benchmarks using Hadoop MapReduce to evaluate and compare the performance impact of three popular hyper visors: a commercial hyper visor, Xen, and KVM. We found that differences in the workload type (CPU or I/O intensive), workload size and VM placement yielded significant performance differences among the hyper visors. In our study, we used the three hyper visors to run several MapReduce benchmarks, such as Word Count, TestDSFIO, and TeraSort and further validated our observed hypothesis using micro benchmarks. In our observation for CPU-bound benchmark, the performance difference between the three hyper visors was negligible, however, significant performance variations were seen for I/O-bound benchmarks. Moreover, adding more virtual machines on the same physical host degraded the performance on all three hyper visors, yet we observed different degradation trends amongst them. Concretely, the commercial hyper visor is 46% faster at TestDFSIO Write than KVM, but 49% slower in the TeraSort benchmark. In addition, increasing the workload size for TeraSort yielded completion times for CVM that were two times that of Xen and KVM. The performance differences shown between the hyper visors suggests that further analysis and consideration of hyper visors are needed in the future when deploying applications to cloud environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data Allocation in Scalable Distributed Database Systems Based on Time Series Forecasting

    Publication Year: 2013 , Page(s): 17 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (291 KB) |  | HTML iconHTML  

    In cloud computing environments, database systems have to serve a large number of tenants instantaneously and handle applications with different load characteristics. To provide a high quality of services, scalable distributed database systems with self-provisioning are required. The number of working nodes is adjusted dynamically based on user demand. Data fragments are reallocated frequently for node number adjustment and load balancing. The problem of data allocation is different from that in traditional distributed database systems, and therefore existing algorithms may not be applicable. In this paper, we first formally define the problem of data allocation in scalable distributed database systems. Then, we propose an algorithm for the problem. The algorithm makes use of time series models to perform short-term load forecasting such that node number adjustment and fragment reallocation can be performed in advance to avoid node over loadings and performance degradation due to fragment migrations. In addition, excessive working nodes can be minimized for resource-saving. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Discussion of Privacy Challenges in User Profiling with Big Data Techniques: The EEXCESS Use Case

    Publication Year: 2013 , Page(s): 25 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (223 KB) |  | HTML iconHTML  

    User profiling is the process of collecting information about a user in order to construct their profile. The information in a user profile may include various attributes of a user such as geographical location, academic and professional background, membership in groups, interests, preferences, opinions, etc. Big data techniques enable collecting accurate and rich information for user profiles, in particular due to their ability to process unstructured as well as structured information in high volumes from multiple sources. Accurate and rich user profiles are important for applications such as recommender systems, which try to predict elements that a user has not yet considered but may find useful. The information contained in user profiles is personal and thus there are privacy issues related to user profiling. In this position paper, we discuss user profiling with big data techniques and the associated privacy challenges. We also discuss the ongoing EU-funded EEXCESS project as a concrete example of constructing user profiles with big data techniques and the approaches being considered for preserving user privacy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Approximate Two-Party Privacy-Preserving String Matching with Linear Complexity

    Publication Year: 2013 , Page(s): 31 - 37
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (201 KB) |  | HTML iconHTML  

    Consider two parties who want to compare their strings, e.g., genomes, but do not want to reveal them to each other. We present a system for privacy-preserving matching of strings, which differs from existing systems by providing a deterministic approximation instead of an exact distance. It is efficient (linear complexity), non-interactive and does not involve a third party which makes it particularly suitable for cloud computing. We extend our protocol, such that it only reveals whether there is a match and not the exact distance. Further an implementation of the system is evaluated and compared against current privacy-preserving string matching algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Engineering Privacy for Big Data Apps with the Unified Modeling Language

    Publication Year: 2013 , Page(s): 38 - 45
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (916 KB) |  | HTML iconHTML  

    This paper describes proposed privacy extensions to UML to help software engineers to quickly visualize privacy requirements, and design privacy into big data applications. To adhere to legal requirements and/or best practices, big data applications will need to apply Privacy by Design principles and use privacy services, such as, and not limited to, anonymization, pseudonymization, security, notice on usage, and consent for usage. We extend UML with ribbon icons representing needed big data privacy services. We further illustrate how privacy services can be usefully embedded in use case diagrams using containers. These extensions to UML help software engineers to visually and quickly model privacy requirements in the analysis phase, this phase is the longest in any software development effort. As proof of concept, a prototype based on our privacy extensions to Microsoft Visio's UML is created and the utility of our UML privacy extensions to the Use Case Diagram artifact is illustrated employing an IBM Watson-like commercial use case on big data in a health sector application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Milieu: Lightweight and Configurable Big Data Provenance for Science

    Publication Year: 2013 , Page(s): 46 - 53
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    The volume and complexity of data produced and analyzed in scientific collaborations is growing exponentially. It is important to track scientific data-intensive analysis workflows to provide context and reproducibility as data is transformed in these collaborations. Provenance addresses this need and aids scientists by providing the lineage or history of how data is generated, used and modified. Provenance has traditionally been collected at the workflow level often making it hard to capture relevant information about resource characteristics and is difficult for users to easily incorporate in existing workflows. In this paper, we describe Milieu, a framework focused on the collection of provenance for scientific experiments in High Performance Computing systems. Our approach collects provenance in a minimally intrusive way without significantly impacting the performance of the execution of scientific workflows. We also provide fidelity to our provenance collection by allowing users to specify three levels of provenance collection. We evaluate our framework on systems at the National Energy Research Scientific Computing Center (NERSC) and show that the overhead is less than the variation already experienced by these applications in these shared environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Consistent Process Mining over Big Data Triple Stores

    Publication Year: 2013 , Page(s): 54 - 61
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (634 KB) |  | HTML iconHTML  

    'Big Data' techniques are often adopted in cross-organization scenarios for integrating multiple data sources to extract statistics or other latent information. Even if these techniques do not require the support of a schema for processing data, a common conceptual model is typically defined to address name resolution. This implies that each local source is tasked of applying a semantic lifting procedure for expressing the local data in term of the common model. Semantic heterogeneity is then potentially introduced in data. In this paper we illustrate a methodology designed to the implementation of consistent process mining algorithms in a `Big Data' context. In particular, we exploit two different procedures. The first one is aimed at computing the mismatch among the data sources to be integrated. The second uses mismatch values to extend data to be processed with a traditional map reduce algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards Cloud-Based Analytics-as-a-Service (CLAaaS) for Big Data Analytics in the Cloud

    Publication Year: 2013 , Page(s): 62 - 69
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (766 KB) |  | HTML iconHTML  

    Data Analytics has proven its importance in knowledge discovery and decision support in different data and application domains. Big data analytics poses a serious challenge in terms of the necessary hardware and software resources. The cloud technology today offers a promising solution to this challenge by enabling ubiquitous and scalable provisioning of the computing resources. However, there are further challenges that remain to be addressed such as the availability of the required analytic software for various application domains, estimation and subscription of necessary resources for the analytic job or workflow, management of data in the cloud, and design, verification and execution of analytic workflows. We present a taxonomy for analytic workflow systems to highlight the important features in existing systems. Based on the taxonomy and a study of the existing analytic software and systems, we propose the conceptual architecture of CLoud-based Analytics-as-a-Service (CLAaaS), a big data analytics service provisioning platform, in the cloud. We outline the features that are important for CLAaaS as a service provisioning system such as user and domain specific customization and assistance, collaboration, modular architecture for scalable deployment and Service Level Agreement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable and Trustworthy Cross-Enterprise WfMSs by Cloud Collaboration

    Publication Year: 2013 , Page(s): 70 - 77
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB) |  | HTML iconHTML  

    Establishing scalable and cross-enterprise workflow management systems (WfMSs) in the cloud requires the adaptation and extension of existing concepts for process management. This paper proposes a scalable and cross-enterprise WfMS with a multitenancy architecture. Especially, it can activate enactment of workflow processes by cloud collaboration. We do not employ the traditional engine-based WfMSs. The key idea is to have the workflow process instance to be self-protected and does not need a workflow engine to secure the data therein. Thus, the process instance discovery and activity execution can be fully independently and distributed. As a result, we can employ the data storage system, Big Table, to store the process instances, which may form a big data. The applying of element-wise encryption and chained digital signature makes it satisfy major security requirements of authentication, confidentiality, data integrity, and nonrepudiation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Bandwidth-Conscious Caching Scheme for Mobile Devices

    Publication Year: 2013 , Page(s): 78 - 85
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (827 KB) |  | HTML iconHTML  

    While a substantial amount of big data are consumed via mobile devices, accessing content via wireless data connections on mobile devices has its own set of challenges. Among these challenges, speed of data transfer is usually our first priority. Although there are many fast data connections available for Web surfing (3G, LTE etc.), the actual connection speed could vary significantly among different regions where fast connections may not be always available. As a result, the user experience of viewing information varies with different type of data connections in different locations. This paper proposes utilising the Type Of Data Connection (Bandwidth) to determine whether a dataset needs to be cached or pre-fetched to reduce the response time and thereby providing a better user experience. The role of the mobile device owner will form the basis of dataset construction criteria by using the technique of role mining. As the mobile devices are not confined to a particular space, an effort to trace the owner's movement determines where the owner with the device is heading towards. This helps to identify the different connection speed patterns along the owner's path, so that different caching or pre-fetching strategy can be deployed beforehand to aim for consistent quality of services. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a Quality-centric Big Data Architecture for Federated Sensor Services

    Publication Year: 2013 , Page(s): 86 - 93
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (373 KB) |  | HTML iconHTML  

    As the Internet of Things (IoT) paradigm gains popularity, the next few years will likely witness 'servitization' of domain sensing functionalities. We envision a cloud-based eco-system in which high quality data from large numbers of independently-managed sensors is shared or even traded in real-time. Such an eco-system will necessarily have multiple stakeholders such as sensor data providers, domain applications that utilize sensor data (data consumers), and cloud infrastructure providers who may collaborate as well as compete. While there has been considerable research on wireless sensor networks, the challenges involved in building cloud-based platforms for hosting sensor services are largely unexplored. In this paper, we present our vision for data quality (DQ)-centric big data infrastructure for federated sensor service clouds. We first motivate our work by providing real-world examples. We outline the key features that federated sensor service clouds need to possess. This paper proposes a big data architecture in which DQ is pervasive throughout the platform. Our architecture includes a markup language called SDQ-ML for describing sensor services as well as for domain applications to express their sensor feed requirements. The paper explores the advantages and limitations of current big data technologies in building various components of the platform. We also outline our initial ideas towards addressing the limitations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

    Publication Year: 2013 , Page(s): 94 - 101
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (399 KB) |  | HTML iconHTML  

    The emergence of many interlinked, physically distributed, and autonomously maintained RDF stores offers unprecedented opportunities for predictive modeling and knowledge discovery from such data. However existing machine learning approaches are limited in their applicability because it is neither desirable nor feasible to gather all of the data in a centralized location for analysis due to access, memory, bandwidth, computational restrictions, and sometimes privacy and confidentiality constraints. Against this background, we consider the problem of learning predictive models from multiple interlinked RDF stores. Specifically we: (i) introduce statistical query based formulations of several representative algorithms for learning classifiers from RDF data, (ii) introduce a distributed learning framework to learn classifiers from multiple interlinked RDF stores that form a chain, (iii) identify three special cases of RDF data fragmentation and describe effective strategies for learning predictive models in each case, (iv) consider a novel application of a matrix reconstruction technique from the field of Computerized Tomography [1] to approximate the statistics needed by the learning algorithm from projections using count queries, thus dramatically reducing the amount of information transmitted from the remote data sources to the learner, and (v) report results of experiments with a real-world social network data set (Last.fm), which demonstrate the feasibility of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-resolution Social Network Community Identification and Maintenance on Big Data Platform

    Publication Year: 2013 , Page(s): 102 - 109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (252 KB) |  | HTML iconHTML  

    Community identification in social networks is of great interest and with dynamic changes to its graph representation and content, the incremental maintenance of community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multi-resolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the k-core metric projected at multiple k values, so that multiple community resolutions are represented with multiple k-core graphs. We then present distributed algorithms to construct and maintain a multi-k-core graph, implemented on the scalable big-data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on different topics in rich social network content simultaneously. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Online Association Rule Mining over Fast Data

    Publication Year: 2013 , Page(s): 110 - 117
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1248 KB) |  | HTML iconHTML  

    To extract useful and actionable information in real-time, the information technology (IT) world is coping with big data problems today. In this paper, we present implementation details and performance results of ReCEPtor, our system for "online" Association Rule Mining (ARM) over big and fast data streams. Specifically, we added Apriori and two different FP-Growth algorithms inside Esper Complex Event Processing (CEP) engine and compared their performances using LastFM social music site data. Our most important findings show that online ARM can generate (1) more unique rules, (2) with higher throughput, and (3) much sooner (lower latency) than offline rule mining. In addition, we have found many interesting and realistic musical preference rules such as "George HarrisonàBeatles". We demonstrate a sustained rate of ~15K rows/sec per core. We hope that our findings can shed light on the design and implementation of other fast data analytics systems in the future. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.