UnifyDR: A Generic Framework for Unifying Data and Replica Placement

The advent of (big) data management applications operating at Cloud scale has led to extensive research on the data placement problem. The key objective of data placement is to obtain a partitioning (possibly allowing for replicas) of a set of data-items into distributed nodes that minimizes the overall network communication cost. Although replication is intrinsic to data placement, it has seldom been studied in combination with the latter. On the contrary, most of the existing solutions treat them as two independent problems, and employ a two-phase approach: (1) data placement, followed by (2) replica placement. We address this by proposing a new paradigm, CDR , with the objective of combining data and replica placement as a single joint optimization problem. Speciﬁcally, we study two variants of the CDR problem: (1) CDR-Single , where the objective is to minimize the communication cost alone, and (2) CDR-Multi , which performs a multi-objective optimization to also minimize trafﬁc and storage costs. To unify data and replica placement, we propose a generic framework called UnifyDR , which leverages overlapping correlation clustering to assign a data-item to multiple nodes, thereby facilitating data and replica placement to be performed jointly. We establish the generic nature of UnifyDR by portraying its ability to address the CDR problem in two real-world use-cases, that of join-intensive online analytical processing (OLAP) queries and a location-based online social network (OSN) service. The effectiveness and scalability of UnifyDR are showcased by experiments performed on data generated using the TPC-DS benchmark and a trace of the Gowalla OSN for the OLAP queries and OSN service use-case, respectively. Empirically, the presented approach obtains an improvement of approximately 35% in terms of the evaluated metrics and a speed-up of 8 times in comparison to state-of-the-art techniques.

We live in an information age, where almost every day-today need of an individual is fulfilled by digitally enabled services.This digital revolution has led to an exponential increase in the scale of data, and today, many Internetbased services (e.g., Facebook, Netflix, etc.) offer data at a never-before-seen scale [10], [23].Although advancements in enabling technologies such as big data and cloud computing have provided us with the necessary machinery and systems (e.g., Apache Hadoop [39] and Spark [46]) to perform data The associate editor coordinating the review of this manuscript and approving it for publication was Chunsheng Zhu .management at scale, effective strategies for data placement and partitioning remain crucial for ensuring the performance of such systems [15].Having said that, the field of data placement has witnessed a humongous amount of research over the past two decades [3], [4], [17], [30], [36], [41], [43], [45], [48], [49].
To better motivate the need for scalable solutions to the data placement problem, we consider two popular application domains-(1) Online analytical processing (OLAP), and (2) Online social networks (OSNs).OLAP is a computing paradigm for exploration and knowledge discovery from large data warehouses, thereby being a cornerstone for business intelligence and analytics [8].Since data warehouses are usually stored in a distributed manner across multiple nodes, successful execution of OLAP queries requires internode transfer of database tables.It is intuitive that identifying a placement of database tables in distributed nodes to reduce the inter-node table transfer during query execution reduces to an instance of the data placement problem.Moving on to the second application domain, OSN services are the most popular Internet-based services in today's world [43].While Facebook and WhatsApp are used by individuals to communicate with their friends across the globe, Twitter has become the most preferred channel for information dissemination such as traffic news, emergency services, etc., to a large audience.Owing to the usage of OSN services at a global scale, the data of OSN users is usually stored in geographically distributed nodes.Thus, if a user wants to mention or access the profile of one of her friends, the profile specific data of the latter has to be transferred to a node closest (in geographical distance) to the former.To this end, even for the OSN use-case, identifying a placement of user data to minimize the inter-node migrations triggered from profile visits or user mentions reduces to an instance of the data placement problem.Note that in both aforementioned applications, replication is required to ensure fault tolerance, while also facilitating reduction in communication cost.
It is evident from the above discussion that both data and replica placement are important for scalable data management.Besides, replication or replica placement is intrinsic to data placement, which is substantiated by a cautious examination of the generalized data placement problem [17] and its objectives.What is more, replication oblivious data placement is a specific instance of the generic data placement problem.Therefore, both data and replica placement should be considered as objectives of a single joint optimization problem.Having said that, although the field of data placement has witnessed significant advancements [3], [17], [43], to the best of our knowledge, none of the existing techniques possess the capability of performing data and replica placement jointly.On the contrary, most of the existing techniques treat the two placement steps as independent problems, and perform data placement followed by replica placement (Fig. 1).A key limitation of this ad hoc two-phase approach is that it results in solutions of inferior quality.
To address the aforementioned limitation, in this article, we propose a unified paradigm of combining data and replica placement, called CDR, as a joint optimization problem.Motivated by the aforementioned applications, we study two variants of the CDR problem.Specifically, as discussed previously, the goal of data placement for the OLAP use-case is to minimize the inter-node database table migration during the execution of OLAP queries.Since the optimization is concerned with minimizing a single objective, i.e., the communication cost, we formally denote the problem as CDR-Single.Recall that since OSN services usually operate at a global scale, a placement that minimizes the communication cost alone by minimizing the inter-node data migrations where the data-items (black dots) are first placed in nodes and then replicated (red dots); and the proposed CDR paradigm (in magenta): where data and replica placement are jointly performed in a single step.cannot be deemed as optimal.More specifically, since both user data and nodes are geographically distributed, factors such as inter-node latency, outgoing traffic, and storage costs, all of which are significantly different 1 for different geographically distributed nodes, need to be included in the optimization objective.Thus, we study a multi-objective optimization problem in the context of combined data and replica placement for OSN services, which is formally referred to as CDR-Multi.To solve both variants of the CDR problem, we propose a generic and unified framework, called Uni-fyDR, which leverages overlapping correlation clustering to address data and replica placement as a joint optimization problem.More specifically, overlapping clustering facilitates joint optimization of data and replica placement in a single step by allowing each data-item to be assigned to multiple nodes.
To summarize, we have comprehensively extended our previous work on combined data and replica placement [4] as a unified framework called UnifyDR.In addition to addressing the CDR problem for data-intensive OSN services in geographically distributed clouds (CDR-Multi) using overlapping clustering on hypergraphs, we solve the CDR problem for workflows originating in business analytics and intelligence, that of OLAP queries (CDR-Single), using graphbased overlapping clustering.This portrays the generalization ability of UnifyDR in addressing the CDR problem for a wide-variety of workflows originating from different realworld use-cases.Specifically, in contrast to [4], the following novel contributions and extensions are added in this work: (1) a new variant of the generic CDR problem, i.e., CDR-Single (Sec.III-B), (2) a generic framework UnifyDR to address the CDR problem under diverse settings (Sec.IV), (3) a new algorithm for solving CDR-Single using overlapping clustering on graphs (Algorithm 1 and Sec.V), and (4) new empirical evaluations on data generated using the TPC-DS benchmark (Sec.VI-A).

Contributions.
• The study of a novel paradigm of combined data and replica placement, CDR, with the objective of unifying the aforementioned placement tasks into a single joint optimization problem (Sec.III).Motivated by two different real-world use-cases of OLAP and OSN services respectively, we study two variants of the CDR problem, called CDR-Single and CDR-Multi.
• A generic and unified framework, UnifyDR (Sec.IV), capable of solving the CDR problem in a single unified step as opposed to the traditional two-step process practiced by the existing state-of-the-art methods.UnifyDR also provides the ability to solve the CDR problem for workflows generated from different real-world applications, which is portrayed in this article using OLAP (CDR-Single) and OSN (CDR-Multi) use-cases.
• A novel algorithm based on overlapping correlation clustering, which allows each data-item to be assigned to multiple nodes (Sec.V).The proposed algorithm performs a single optimization for the OLAP use-case using overlapping clustering on graphs, while overlapping clustering on hypergraphs is employed for the OSN use-case to perform a multi-objective optimization for minimizing latency, node span, inter-node traffic, and storage cost.
• An extensive experimental study on data simulated based on the TPC-DS decision support benchmark and a trace of the Gowalla social network (Sec.VI), to showcase the effectiveness and scalability of the proposed UnifyDR framework and its associated CDR placement algorithm in solving the CDR-Single and CDR-Multi problems.

II. RELATED WORK
In the past decade, the data placement problem has witnessed extensive research with a wide variety of techniques developed for different execution environments, namely -distributed computing [9], [17], grid computing [13], [26], [27], and cloud computing [16], [19], [30], [44].Initially, the focus of these works was on relational workloads such as database joins [17] and scientific workloads [14], [31], [45], however, recently the focus has shifted towards workloads emanating from specialized applications such as OSN services [21], [24] and data intensive services in geo-distributed clouds [1], [41]- [43], [47].Given that our focus in this work is to combine data and replica placement as a single joint optimization problem, we only present a review of the existing literature on data placement that is directly related to our work.
The two main capabilities required to address the geo-distributed data placement problem are to capture and improve (1) data-item -data-item associations and (2) data-item -node associations.The former is measured as the frequency of co-occurrence of two or more dataitems, whereas the latter is calculated using the frequency of occurrence of a data-item at a given node.While data-itemdata-item associations were captured by methods relying on hierarchical clustering of data-item correlations [48], [49] and frequent pattern mining [33], literature also witnessed techniques [1], [22], [35], [47] capable of capturing data-itemnode associations.Volley [1], a system proposed by Agarwal et al., performs automatic data placement in geo-distributed nodes based on co-occurrence information mined from the server logs of node requests.Rochman et al. [35] design algorithms that not only are cost efficient but also capable of serving a significant portion of user requests raised within the same region, along with the ability to manage the dynamic behavior of user requests.It is important to simultaneously honor the node storage capacities as well as minimize the data communication costs.To this end, Zhang et al. [47] propose an algorithm based on integer programming.
Literature also includes works focusing on other aspects related to the geographically distributed data placement problem, such as development of specialized strategies for replication and data placement in multi-clouds.Shankarnarayanan et al. [37] proposed strategies for locationaware replica placement2 in order to minimize inter-node communication costs and other node location specific metrics.Shifting the focus of our discussion to multi-cloud environments, Jiao et al. [24] present a technique that takes multiple optimization objectives, such as inter-cloud traffic and carbon footprint, into consideration to perform data placement in multi-clouds.Later, Han et al. [21] proposed an algorithm for OSN service data migration, which can adapt to the variation in data traffic in multi-clouds.
Having said that, none of the aforementioned techniques are capable of modeling both data-item -node and data-item -data-item associations.
Recently, methods based on hypergraph representation of data-items and nodes have been extensively used in the literature for data placement in geographically distributed clouds.Yu and Pan [41]- [43] introduce the use of hypergraph modeling and leverage a partitioning tool called PaToH [7] to design data placement algorithms for data intensive services.On the one hand, hypergraph modeling helps to simultaneously capture both data-item -node and data-item -data-item associations.On the other hand, publicly available specialized heuristics for hypergraph partitioning [7] enable graceful scaling of the aforementioned methods to large datasets.Moving further, Atrey et al. [3], [5] proposed an algorithm based on spectral clustering of hypergraphs, which portrayed quality similar to the algorithms proposed in [43], however, achieved superior efficiency and scalability owing to the use of randomized eigendecomposition techniques for factorizing the hypergraph laplacian.
Owing to their ability to capture both data-item -node and data-item -data-item associations, the methods proposed by Atrey et al. [3] and Yu and Pan [43] constitute the representative state-of-the-art for geo-distributed data placement of data-intensive services, and are therefore considered for empirical comparisons with UnifyDR in the OSN use-case.
(Hyper)graph based solutions have also been popular for data placement of more traditional workloads such as scientific and relational workflows.The existence of a polynomialtime reduction of the data placement problem into an instance of the graph partitioning problem was proved in [17].Furthermore, Golab et al. [17] proposed three algorithms-an optimal algorithm based on integer linear programming (ILP) and two heuristics for practically solving the problem on large scale workloads-to perform data placement for joinintensive database queries and data-intensive scientific workflows by minimizing the overall network communication cost.With the objective of reducing the partitioning overhead, Quamar et al. [34] present SWORD, a light-weight and scalable approach for data placement of Online Transaction Processing (OLTP) workloads.Specifically, SWORD performs data placement in two phases.Hypergraph modeling and hash partitioning based compression of the constructed hypergraph are carried out in the first phase, which is followed by partitioning of the compressed hypergraph in the second phase.
Owing to their scalability and effectiveness, the techniques, Metis and Metis+H2, proposed by Golab et al. [17] constitute the representative state-of-the-art for data placement of traditional data-intensive workloads, and are therefore considered for empirical comparisons with UnifyDR in the OLAP use-case.
Based on the aforementioned discussion, all the existing state-of-the-art techniques lack the capability of jointly solving data placement and replication.On the contrary, these techniques treat the two placement steps as independent problems, and perform data placement followed by replica placement, thereby resulting in solutions of inferior quality.To the best of the authors' knowledge, the research presented in this article is the first effort towards combining data and replica placement (CDR) into a single joint optimization problem, which is solved using overlapping correlation clustering on graphs (for the OLAP use-case) and hypergraphs (for the OSN use-case).More specifically, overlapping clustering facilitates joint optimization of data and replica placement in a single step by allowing each data-item to be assigned to multiple nodes.To summarize, the UnifyDR framework provides an elegant solution to both variants of the CDR problem, i.e., CDR-Single and CDR-Multi.

III. COMBINED DATA AND REPLICA PLACEMENT
Although the CDR paradigm is generically applicable to myriad settings, we motivate CDR-Single specifically through OLAP join queries and CDR-Multi through location-based OSN services.We first introduce the basic terminology related to data and replica placement.It is common for data-items to be placed across geographically distributed nodes in large-scale systems.Naturally, migration of some data-items may be required for proper execution of various tasks.Having said that, a data-request pattern is comprised of data-items that require migrations.Formally, Definition 3 (Data-Request Patterns (R)): A data-request pattern R ∈ R is comprised of a set of data-items D ⊆ D that are required to be present together in a single node N j for a given task to be executed.The data-items (d i ∈ D) that are not stored in N j are transferred from the nodes in which they are stored to N j .The set of data-request patterns denoted as R represent the system workload.
In addition to distributing data across nodes, real-world systems usually store multiple replicas of each data-item as well.This is because replication helps in ensuring faulttolerance, while also facilitating reduction in communication cost and retrieval latency by potentially allowing for dataitem retrieval from a geographically closer node.
Definition 4 (Replication Factor (r)): The replication factor r is defined as the number of replicas stored for each data-item.
Given the replication factor, a set of nodes, data-items, and data-request patterns as input, the objective of CDR is to partition the set of data-items, allowing for replication wherever applicable, across distributed nodes in order to minimize the overall communication cost emanating from migration/replication3 of data-items corresponding to different data-requests.At this juncture, we would like to clarify that the CDR placement algorithm presented in this work considers the system workload to be static.More specifically, any change (small or large) in the system workload would require re-execution of the full pipeline to obtain the placement output.This design decision is in line with almost every existent technique [3]- [5], [16], [17], [43], [49] in the extensive literature on data placement.Thus, making the CDR placement algorithm dynamically adapt to the changes in the system workload is not in the scope of the current work.
Having defined the basics, we next discuss concepts specific to OLAP join queries and location-based OSN services, portray their relationship to the CDR problem, and formally define the CDR-Single and CDR-Multi problems.

B. CDR-SINGLE
As discussed in Section I, many OLAP queries involve database joins.A sample join query on a database comprised of four tables partitioned across four servers is portrayed in Fig. 2.There are two central aspects pertaining to OLAP queries: (1) a database of tables, where each table contains information specific to a real-world entity, and (2) an execution engine that allows users to submit and execute analytical queries.

Definition 5 (Database (D(T ))): A database D(T ) is a collection of information tables T : |T | = n, where each table consists of one or more attributes.
In the context of the OLAP use-case, each database table corresponds to a data-item (Def.1).Thus, the set D contains n data-items corresponding to each table t ∈ T of the database, where the data-item corresponding to a table t is denoted as d(t).
Moving ahead, queries allow end-users to perform a variety of analytical tasks, such as computing the average quarterly sales and profit per product category.Such queries might require information from multiple database tables (often partitioned across servers), thereby triggering a data-request pattern (Def.3) comprised of the tables that need to be migrated for proper query execution.To enable efficient query execution, naturally, each query is executed at a node N j ∈ N (Def.2) which requires the least number of tables to be migrated.
Thus, the data-request pattern where QS(Q k ) denotes the set of tables required for executing the query Q k .Given this information, we formally define a query as follows: triggered from a node N j capable of serving user requests.The set Q contains η queries and represents the system workload.For example, the sample query Q 1 portrayed in Fig. 2 would be executed at the node N 3 (i.e., Server C), and would trigger a data-request pattern R(Q 1 ) where Next, we provide a formal description of the CDR-Single problem, which is stated as follows.
Problem (CDR-Single): Given a set of n data-items D corresponding to the set of database tables T , η user queries Q k ∈ Q representing the system workload, where each query comprises a data-request pattern R(Q k ) being originated from a node N j ∈ N , and the replication factor r, perform combined data and replica placement to minimize the average number of nodes spanned S(R(Q k )) by the data-items corresponding to the request pattern R(Q k ) of each query.

C. CDR-MULTI
As discussed in Section I, we study the CDR-Multi problem in the context of location based OSNs.There are two central aspects pertaining to location based OSN services: (1) a social network of users with network connections indicating friend relationships, and (2) a list of check-ins triggered by the users of the OSN service visiting diverse locations across the globe.A sample social network of seven users with six checkins registered at four different node locations is presented in Fig. 3.

Definition 7 (Social Network (G(V , E))): A social network with n individuals and m social ties can be denoted as a graph G(V , E), where V is the set of vertices representing the users of the social network, |V | = n, and E is the set of edges (representing friend relationships) between any two vertices, E
For the OSN use-case, a data-item (Def. 1) corresponds to the most recent snapshot of a user's profile (e.g., comments, posts, profile picture, etc.).The data-item corresponding to a social network user v ∈ V is denoted as d(v), and there exists a total of n data-items (one for each social network user) in the set D.
Moving further, check-ins characterize the OSN users' behavior of visiting different places in the world.A user check-in usually consists of two parts: (1) a geographic location in the world where the user registered the check-in, and (2) a data-request pattern triggered in response to the user check-in.Note that the location recorded for a user check-in may be different from the location where the check-in was registered.This is because for each check-in, the recorded location is that of a node (Def.2), which is closest (in geographical distance) to the location where the user registered the check-in.Thus, for OSN services, each node N j ∈ N possesses a location attribute N j .loc.The node locations are represented using the set L, where L j = N j .loc: N j ∈ N , resulting in a total of |L| = l node locations.
Moreover, the data-request pattern R(v) ∈ R (Def.3) triggered by a check-in from a user v at a node N j is composed of the data-items corresponding to each of her friends.This is because a user may want to tag/mention some of her friends while registering a check-in.Mathematically, R(v) = {d(u) | u ∈ Adj(v)}, where Adj(v) is the set of all the friends of the user v. Next, we provide a formal description of check-ins, which is stated as follows.
Definition 8 (Check-Ins (C)): A check-in is a tuple In other words, the check-in C k = (R(v), L j ) by a user v at a location L j signifies a request for the data-items contained in R(v) triggered from the node N j located at L j = N j .loc.Considering the example presented in Fig. 3, if the first checkin was registered by the OSN user v 5 at the node N 3 with It should be noted that each check-in, even if it is registered by the same OSN user at the same location, is considered as unique or different from all other check-ins.This is required for modeling data-item -node and data-item -data-item associations appropriately.For instance, let's consider that the OSN user v 7 visited Frankfurt and Sydney 7 and 2 times, respectively.Intuitively, the association of the data-items contained in the data-request pattern R(v 7 ) is relatively stronger with the node located in Frankfurt when compared to the one located in Sydney.This behavior is properly modeled by registering 7 different check-ins numbered C k , . . ., C k+6 for the OSN user v 7 with the data-request pattern R(v 7 ) at N 3 with L 3 = N 3 .loc= Frankfurt.Similarly, 2 different checkins C k+7 , C k+8 for the same user v 7 are registered at N 2 with L 2 = N 2 .loc= Sydney.In the same vein, the data-itemdata-item association between two data-items that co-occur more frequently (say, five times for data-items d(v 3 ) and d(v 4 )) in data-request patterns would be stronger when compared to that of data-items that are requested together rarely (say, only once for data-items d(v 4 ) and d(v 5 )).Furthermore, the aforementioned discussion provides substantive evidence in favor of our design choice to not index user checkins uniquely using data-request patterns R and check-in locations L j .
Having discussed the concepts specific to OSN services, the CDR-Multi problem is formally defined as follows.

Problem (CDR-Multi): Given a set of n data-items D corresponding to the set of social network users V , ρ user checkins C
N representing the system workload, where each check-in comprises a datarequest pattern R(v) being originated from a node located at L j , a set N of l nodes, with the per unit cost of outgoing traffic from each node (N j ) | N j ∈ N , the per unit storage cost of each node (N j ) | N j ∈ N , the inter node latency (directed) for each pair of nodes κ(N j , N j ) | N j , N j ∈ N , the average number of nodes spanned by the data-items corresponding to each request pattern R(v) being S(R(v)), and the replication factor r, perform combined data and replica placement to minimize the optimization objective O, which is defined as the weighted average4 of (•), κ(•, •), (•), and S(•).

IV. UnifyDR
In this section, we provide a description of the UnifyDR framework, its core components, and the underlying combined data and replica placement algorithm.An architectural overview of the proposed UnifyDR framework is presented in Fig. 4.
We begin by providing a description of the building blocks of the UnifyDR framework.
• Construct Graph.The module responsible for constructing a binary graph adjacency matrix using the information about associations between database tables (data-items) manifested in the OLAP queries submitted by the users.More specifically, the data-item -data-item association between two database tables co-occurring in a join query is modeled using an edge between them in the constructed graph.
• Calculate Edge Weights.This module employs the use of query set characteristics to assign weights to edges constructed in the aforementioned step.Edge weights capture the strength of data-item -data-item associations, thereby appropriately accounting for the contribution of each edge towards minimizing the objective for CDR-Single.
• Construct Hypergraph.The module responsible for constructing a binary hypergraph incidence matrix using the information about a variety of associations between OSN users (data-items) and nodes manifested in the check-ins registered by these users.More specifically, the data-item -data-item association between OSN users co-occurring in a data-request pattern triggered by a user check-in is modeled using a hyperedge connecting data-items in the constructed hypergraph.In the same vein, the data-item -node association is modeled using a hyperedge connecting the data-item with the node location where the data-item was requested based on the user check-in.
• Calculate Hyperedge Weights.This module is responsible for assigning weights to each hyperedge of the hypergraph constructed in the aforementioned step based on user check-in behavior and node characteristics.Hyperedge weights capture the strength of data-item -node and data-item -data-item associations, thereby enabling accurate estimations of the contribution of each hyperedge towards the multi-objective optimization for CDR-Multi.
• Construct (Hyper)Graph Similarity Matrix.This module uses the (hyper)graph representation and the (hyper)edge weights to compute the similarity between each pair of vertices in the (hyper)graph.It is required for performing analytical operations such as clustering on (hyper)graphs.Mathematical details of this step for both graphs and hypergraphs are provided in Sec.V.
• Greedy Cluster Refinement.This module facilitates partitioning the data-items into l nodes while allowing at most r replicas.Specifically, it iteratively refines the cluster assignment of each data-item given the cluster assignment of all the other data-items is fixed.

A. CDR-SINGLE: GRAPH MODELING
Data placement research has shown that graph partitioning is capable of accurately optimizing the objective of minimizing the number of nodes spanned by data-items partitioned across nodes [12], [17], [28].Since the aim of CDR-Single is to minimize the number of nodes spanned during the execution of OLAP queries, graphs provide a powerful representation to model data-item -data-item associations in this case.
To this end, given a set of database tables D(T ) corresponding to data-items, and a set of queries Q representing the system workload, we construct a graph G(V G , E G ).There exists a vertex v ∈ V G for each data-item d(t) ∈ D, thus, the vertex set V G consists of |V G | = n vertices.Furthermore, there exists an edge between each pair of data-items (corresponding to database tables) that co-occur in a join query, thereby capturing data-item -data-item associations.Recall that for each query Q k ∈ Q, QS(Q k ) denotes the set of data-items (tables) that should co-exist for execution of the query.In other words, there should be an edge between each pair of vertices (corresponding to data-items) in QS(Q k ).With this, we formally define the edge set The graph G(V G , E G ) is represented using a n × n dimensional binary adjacency matrix A, which possesses n vertices and m edges.An entry A i,j = 1 indicates that there is an edge between the i th and j th vertex in the graph vertex set, while A i,j = 0 indicates otherwise.
While each edge (u, v) ∈ E G captures the association between two vertices u and v, not all associations are equally important.Instead, some associations are relatively more important.For example, two data-items d(t 1 ) and d(t 2 ) that co-occur in 10 join queries possess a stronger association when compared to another pair of data-items d(t 1 ) and d(t 3 ) that co-occur only twice.Thus, partitioning the edge between d(t 1 ) and d(t 2 ) should possess a higher cost when compared to that of the edge between d(t 1 ) and d(t 3 ).To capture this, we construct an edge weight matrix W A of dimensionality n × n that captures the relative importance of the edges.Each entry W Ai,j captures the number of times the data-items corresponding to the i th and j th vertex co-occurred in a join query Q k ∈ Q.
In sum, the graph modeling step facilitates capturing the interaction between data-items in the form of a graph adjacency matrix A, and the edge weight matrix W A representing the relative importance of the constructed edges.

B. CDR-MULTI: HYPERGRAPH MODELING
The suitability of hypergraphs to model the associations emanating from data-item -node and data-item -data-item interactions has been substantiated in a plethora of works [3], [4], [43] on geo-distributed data placement.As opposed to edges (in traditional graphs) that can only model pairwise relationships, hyperedges possess the capability of modeling multi-way relationships by connecting several vertices together using a single hyperedge.Thus, in all respects, hypergraphs serve as a generalization over graphs, and being a more sophisticated construct, provide a solid representation for modeling data-item -data-node and data-item -data-item associations.
The set E H of hyperedges constructed based on the information manifested in the check-ins registered by the OSN users consists of two types of hyperedges.(1) Data-request pattern hyperedges (R): For each data-request triggered corresponding to a user check-in, a data-request pattern hyperedge captures data-item -data-item associations by connecting all of its constituent data-items via a single hyperedge; and ( 2 We use a binary hypergraph incidence matrix with n rows and m columns to represent the constructed hypergraph H (V H , E H ).Moreover, each hyperedge is represented via a n × 1 binary column vector and the hypergraph contains a total of m hyperedges.Mathematically, An entry he j,i = 1 indicates that the j th vertex in the hypergraph vertex set is participating in the i th hyperedge, while he j,i = 0 indicates otherwise.
Moving further, we discuss the hyperedge weight assignment mechanisms, which lie in two broad categories corresponding to the aforementioned hyperedge types.These hyperedge weights guide the optimization by focusing on different objectives of the underlying optimization problem.For instance, to minimize the node span S(R(v)), which is computed as the average number of node accesses for fetching the data-items requested in a data-request pattern R(v), W R enforces these co-occurring data-items in R(v) to be placed in the same node by giving higher weight to datarequest pattern hyperedges.In the same vein, to minimize the outgoing traffic cost (N j ), storage cost (N j ), and inter node latency κ(N j , N j ), a higher preference is given towards placement of data-items at the nodes from where they were requested more often, by employing the use of data-itemnode hyperedge weights W R N , W R N , and W κ R N , respectively.Eventually, we perform a weighted sum of the four weighting mechanisms presented above to obtain the final hyperedge weight matrix.Mathematically, where, the vector W controls the relative importance of the aforementioned hyperedge weight assignment strategies 5 to obtain the final diagonal hyperedge weight matrix W of size m × m .
To summarize, the hypergraph modeling step produces a hypergraph incidence matrix and a hyperedge weight matrix W , where the former ( ) models the higherorder interaction between data-items and nodes, while the latter (W ) controls the relative importance of different hyperedges.

C. OVERLAPPING CORRELATION CLUSTERING
Overlapping clustering has been shown to possess applications in a wide-variety of research areas: community detection [29], bioinformatics [6], and information retrieval [6].
The key feature of overlapping clustering is that it allows each data point to be assigned to more than one cluster, which is a natural requirement for multiple real-world applications.For example, it is highly likely for a user to be a part of more than one community.Having motivated the importance of overlapping clustering, we next provide a brief summary of the steps required to perform overlapping correlation clustering on (hyper)graphs.
The first step requires construction of a (hyper)graph similarity matrix, where each entry corresponds to a similarity score between two vertices.Moving ahead, each vertex is randomly assigned to at most l different clusters.The next step is to greedily refine the cluster assignments of each vertex (given the assignments of all other vertices is fixed), with the objective that the similarity between any two vertices agree as much as possible with the similarity computed based on their cluster assignments.This two step process: (1) (Hyper)Graph similarity computation, followed by (2) Greedy cluster refinement, constitutes the proposed combined data and replica placement algorithm.

V. COMBINED DATA AND REPLICA PLACEMENT ALGORITHM
In this section, we first describe the overlapping correlation clustering algorithm in a detailed and formal manner.Next, we describe OverlapG and OverlapH, the combined data and replica placement algorithms proposed in this work.
Given the set of user queries Q and user check-ins C representing the system workload for the OLAP (CDR-Single) and the OSN (CDR-Multi) use-case, respectively, and the set of data-items D as input, the first step is to construct a graph (line 2 in OverlapG) or a hypergraph (line 8 in OverlapH).This is followed by the construction of normalized graph (lines 3-4 in OverlapG) or hypergraph (lines 9-10 in Compute normalized graph G sim as described in Eq. ( 9) 5: Compute normalized hypergraph H sim as described in Eq. ( 12) 11:

P(D)
← OverlappingCorrelationClustering (V H , N H , l, r) 12: end procedure 13: return P(D) OverlapH) similarity matrix depending upon the use-case.The last step employs the use of the proposed overlapping clustering algorithm (line 5 in OverlapG and line 11 in Over-lapH) to assign each data-item d(t) or d(v) ∈ D to r < l nodes, thereby obtaining a partitioning of D while allowing for at most r replicas per data-item.

A. PRELIMINARIES
As a first step, we provide a description of correlation clustering.Given a normalized similarity matrix M sim representing the pair-wise similarity between data-items, and a set N of l labels representing nodes as input, the task of correlation clustering is to find a mapping F : V → N for partitioning the set of data-items into l nodes, that minimizes the following loss function: Moving ahead, overlapping correlation clustering was introduced to relax the requirement of correlation clustering to assign each data-item to exactly one partition.This design choice is well-motivated.For instance, in social networks a user might be a part of multiple communities.In the context of data placement, replication of data-items is often required for ensuring fault-tolerance and obtaining a lower overall communication cost.
Having said that, we leverage this feature of overlapping clustering to obtain a partitioning of D into l nodes by assigning each data-item to multiple nodes, thereby allowing for replication.In overlapping clustering, this is achieved by mapping each data-item to a label set as opposed to a single label, where each label corresponds to a node.Given the label set definition as the set of all subsets of nodes N except the empty set: N + = 2 N \ {∅}, and a similarity function over the data-item label sets S(•), the underlying optimization objective reduces to identifying a mapping F : V → N + under which the similarity between any pair of data-items ∀u, v ∈ V , M sim (u, v) agrees as much as possible with the similarity between their corresponding label sets S(F(u), F(v)).
Similar to the loss function for correlation clustering L Correlate , the loss function for overlapping correlation clustering is defined as: ( where S(•) is defined as the set-intersection indicator function: Formally, the goal of overlapping correlation clustering is to find a mapping F * in order to minimize L Overlap (V , F), which is mathematically denoted as: Note that for the OLAP use-case the underlying representation is a graph, hence the set of vertices is represented using V G while the normalized similarity matrix M sim is represented using G sim .Similarly, for the OSN use-case the set of vertices is represented using V H while the normalized hypergraph similarity matrix is denoted as H sim .
As stated in Sec.IV, overlapping correlation clustering requires a similarity matrix denoting similarities between each pair of vertices in the (hyper)graph as input.Thereby, we provide a formal description of the graph and hypergraph similarity matrix construction in the following sections.

B. CDR-SINGLE: NORMALIZED GRAPH SIMILARITY MATRIX
To construct the normalized graph similarity matrix G sim , we first compute the vertex degree matrix (D vA ) from the graph adjacency matrix A. D vA is a diagonal matrix of dimensionality n×n, which captures the number of adjacent vertices for each vertex in the graph.Mathematically, end for 7: end while 8: return P(V ) defined by F where, X represents the row-wise sum of the input matrix X .
Note that the methods for normalizing a similarity matrix was described by [32], [40] for constructing the graph laplacian.To this end, we mathematically define the normalized graph similarity matrix G sim as follows.
where, W A is a n×n edge weight matrix.Thus, G sim becomes a n × n matrix.

C. CDR-MULTI: NORMALIZED HYPERGRAPH SIMILARITY MATRIX
A main requirement for constructing the normalized hypergraph similarity matrix H sim is the computation of two diagonal degree matrices from the hypergraph incidence matrix .(1) The vertex degree matrix D v of size n × n , which measures the number of hyperedges that are connected to each and (2) the hyperedge degree matrix D he of size m × m , which captures the number of vertices that are connected together by each hyperedge.The two degree matrices are mathematically defined as follows.
where, X represents the row-wise sum of the input matrix X and X T represents the transpose of the matrix X .Similar to the normalization procedure for a similarity matrix [32], [40], the normalization procedure for hypergraphs was defined in [50].To this end, we mathematically define the normalized hypergraph similarity matrix H sim as follows.
where, D v and D he are diagonal matrices of size n ×n and m × m storing vertex and hyperedge degrees, respectively.Moreover, W is a m × m diagonal matrix representing hyperedge weights.With this, the dimensionality of the normalized hypergraph similarity matrix H sim becomes n × n .
Having described the (hyper)graph similarity matrix construction, we next discuss the optimization approach for performing overlapping clustering.

D. GREEDY CLUSTER REFINEMENT
Overlapping correlation clustering cannot be solved optimally in polynomial time as it is a NP-Hard problem [6].Thus, we employ the use of an iterative greedy algorithm focused on improving the label set quality of one vertex at a time.More specifically, keeping the label sets of all the other vertices in the (hyper)graph as fixed, the greedy algorithm applies a local optimization (on one vertex) to improve the cost of the overall solution until convergence.The steps for performing overlapping correlation clustering are portrayed in Algorithm 2.
The first step requires initializing each vertex u ∈ V with a set of cardinality r consisting of randomly assigned labels (line 1).This initialization allows each data-item to be simultaneously assigned to r different nodes.Next, the aforementioned greedy local optimization approach is applied to iteratively refine the label sets of each vertex (lines 2-7).More specifically, the label set of each node u ∈ V is iteratively improved while keeping the label sets of all the other nodes fixed, until the overall loss L Overlap (V , F) converges.Note that Eq. ( 5) is reformulated to clearly depict the loss with respect to each node u, and is stated as follows. where,

VI. EXPERIMENTS
In this section, we evaluate the effectiveness of the proposed UnifyDR framework in solving the CDR-Single and CDR-Multi problems through experiments on data simulated using a decision support benchmark, and data extracted from a large scale location-based OSN respectively.We perform experiments with algorithms implemented in C++ on an Intel(R) Xeon(R) E5-2680 v3 24-core machine with 2.5 GHz CPU and 256 GB RAM running Linux Ubuntu 18.04.Results corresponding to all the methods (barring Nearest) are averaged over 10 runs.

A. CDR-SINGLE 1) DATASET
In accordance with the state-of-the-art for data placement research in scientific workflows [17], we incorporate the use of data simulated based on the TPC 6 decision support (TPC-DS) benchmark [11].The TPC-DS benchmark provides an appropriate medium to study the OLAP use-case as it facilitates modeling of various performance aspects of a decision support system, such as query execution and data maintenance.The benchmark contains a total of 24 database tables (7 fact and 17 dimension tables) corresponding to dataitems.Additionally, it contains 99 queries that represent the system workload.

2) SETUP
A real-world distributed execution environment is simulated by partitioning the 24 data-items in l nodes.The number of nodes l is varied from 2 to 8. Since usually the fact tables are significantly larger than the dimension tables, the size of the data-items corresponding to the fact tables is chosen from the range [50, 100] units, while those corresponding to the dimension tables is chosen from the range [1,10].Furthermore, let the total size of all the data-items be S = t∈T size(d(t)) and the size of the largest data-item be S max , then the storage capacity of each node N j ∈ N is set as max(S/l, S max ).This is because the storage capacity of a node is dependent upon the total size of all the data-items as well as the size of the largest data-item.In summary, the CDR placement task for the OLAP usecase reduces to partitioning 24 data-items corresponding to the database tables into {2, 4, 8} nodes respectively based on the data-request patterns triggered from 99 user queries.
Baselines: We considered the following two baselines.
• Random: obtains a placement by randomly partitioning the set of data-items D into |N | nodes.
• Bipartite graph partitioning (Metis): is the placement algorithm proposed in [17], which constructs a bipartite graph of the set of queries Q and the set of tables D(T ), and employs Metis [25] to perform graph partitioning.The authors tackle replica placement using a heuristic termed H2.As discussed in Sec.II, Metis (with the replication heuristic H2) [17] serves as the representative state-of-the-art method for data placement and replication of data-intensive scientific workflows.
Parameters: The parameters for Metis are set based on the recommendations by the authors of [17], where k-way partitioning with 1000 cuts and 1000 iterations is used.Lastly, we fix the replication factor r to 3 based on best practices prescribed in the field of data storage management [20].
Evaluation Metrics: The metrics used to evaluate the effectiveness of OverlapG in solving the CDR-Single problem are stated as follows: • Efficiency: The execution time, which is measured as the time required to obtain CDR placement, is used to evaluated the efficiency of the methods benchmarked in this study.
• Efficacy: is measured using the node span S(•) of datarequest patterns, which is defined as the average number of node accesses required to fetch the data-items requested in a query Q k corresponding to a request pattern R(Q k ).The average of the node spans of each data-request pattern ∃Q k ∈ Q : R(Q k ) ∈ R represents the span of the entire system workload.Note that we normalize the node spans obtained from different techniques in the scale of [0, 1] as it provides an intuitive way to analyze their relative performance.Furthermore, since the optimization problem underneath CDR-Single is concerned with the minimization of node span, smaller values imply better performance.

3) RESULTS
Figs. 5a-5d present the results for the CDR-Single problem on the considered evaluation metrics.We compare OverlapG with the considered baselines on two settings: (1) no replication (r = 0), and (2) with replication (r = 3).This analysis is performed to showcase the backward compatibility of overlapping clustering.In other words, it indicates the ability of overlapping clustering to assign each data-item to one and only one node.Fig. 5a shows that OverlapG produces a node span similar to Metis, which is the current state-of-theart.Furthermore, both OverlapG and Metis are significantly better when compared to the Random baseline, thereby portraying their ability to effectively capture data-item -dataitem associations.Having said that, we also compare the execution time of OverlapG with Metis under this setting.Fig. 5c shows that both techniques require a similar amount of time, 2.3 -4.6 seconds for Metis and 2.6 -4.9 seconds for OverlapG, to perform data placement.Note that the execution time of OverlapG is slightly on the higher side owing to the relatively higher complexity of the optimization.
Having analyzed the ability of OverlapG to perform data placement (without replication r = 0), we next evaluate its ability to effectively solve the CDR-Single problem.As discussed previously, for this analysis we consider the setting where each data-item can possess at most 3 replicas (r = 3).It is evident from Fig. 5b that OverlapG significantly outperforms the Random baseline, while also achieving a reduction of around 30% in node span when compared to Metis+H2.This reduction is attributed to the ability of OverlapG to jointly optimize for data and replica placement in a unified manner.On the contrary, the two-step approach of performing data placement using Metis followed by replica placement using a heuristic (H2) falls short of finding a high quality solution to the CDR-Single problem.Furthermore, as shown in Fig. 5d it is interesting to note that OverlapG achieves a speed-up of around 20% over Metis+H2 as well.While OverlapG possessed a slightly higher execution time when compared to Metis for r = 0, it is faster when compared to Metis+H2 for r = 3.The main reason behind this is the requirement of executing two algorithms (Metis and H2) on the entire workload for the latter.
In summary, our experimental evaluation portrays the effectiveness of OverlapG in solving the CDR-Single problem.Additionally, its ability to perform data and replica placement in a single step allows for a better and unified system design.

B. CDR-MULTI 1) DATASET
We use the publicly available Gowalla 7 social network dataset, as it is a popular choice in the data placement literature on geo-distributed cloud services [3], [43].The dataset consists of 196591 social network users (represented via vertices in the social network), and 950327 friend relationships (represented via edges).Additionally, the dataset provides information pertaining to user behavior.It contains 6442890 check-ins registered by social network users between February, 2009 and October, 2010, triggering 102314 unique data-request patterns in total.

2) SETUP
The AWS global infrastructure [2] is used as a basis for simulating a geographically distributed cloud execution environment.Following convention in the literature [3], [41], 7 http://snap.stanford.edu/data/loc-gowalla.html the l = 9 oldest AWS data center (node) regions 8 -California, Frankfurt, Ireland, Oregon, Sao Paulo, Singapore, Sydney, Tokyo, and Virginia-are used to setup our experiments.The traffic and storage costs for each node are set as advertised by Amazon.Furthermore, we use the Linux ping command [38] to obtain the packet transfer latency between the chosen node regions, which provides a good estimate of their inter node latency.The aforementioned steps enable us to closely mimic the real AWS execution environment.Table 1 presents the node characteristics.
Based on our analyses, we identified the existence of disparity in the user check-in behavior.More specifically, while there exist nodes that receive a huge number of checkins (e.g., Frankfurt and Virginia), there are others where only a few check-ins are registered (e.g., Sydney and Sao Paulo).Having said that, both the number and the size (measured as the size of the triggered data-request pattern) of the  check-ins registered in a region dictate the storage capacity required at each node.The storage capacity of each node In summary, the CDR placement task for the OSN usecase reduces to distributing 196591 data-items corresponding to OSN users into 9 nodes based on 102314 data-request patterns triggered from user check-ins.
Baselines: We consider the following four baselines.
• Random: obtains a placement by randomly partitioning the set of data-items D into |N | nodes.
• Nearest: produces a placement by assigning each dataitem to the node with the highest number of requests for that data-item.
• Hypergraph Partitioning (Hyper): partitions the hypergraph induced on data-items and nodes using algorithms from the PaToH toolkit [7] to produce the data placement output.Hyper was proposed in [41], [43].
• Spectral Clustering (Spectral): obtains a placement using spectral clustering on hypergraphs and achieves superior efficiency by leveraging randomized eigendecomposition methods.Spectral was proposed in [5].Recall that as described in Sec.II, the representative stateof-the-art for data placement and replication of data-intensive services into geographically distributed clouds is comprised of the techniques Spectral [5] and Hyper [43].
Parameters: The weight vector W (Eq. ( 3)) is one of the most important parameters for tuning the optimization underlying Spectral, Hyper, and OverlapH algorithms.It enables preferred optimization of certain chosen evaluation metrics over others by controlling the relative importance of different hyperedge weights described in Sec.IV-B.For our experiments, we have employed the use of 4 different settings of the parameter W, namely W 1 : {100, 1, 1, 1}, W 2 : {1, 100, 1, 1}, W 3 : {1, 1, 100, 1}, and W 4 : {1, 1, 1, 100} for minimizing the node span S(•), inter node traffic (•), inter node latency κ(•), and storage cost (•), respectively.Note that the value 100 used to obtain different weight-vector settings W 1 -W 4 is only indicative of the higher relative importance provided to the metric under consideration.Having said that, any value 1 should facilitate reproducibility of the main findings in the presented results, as the observed trends do not depend on the chosen value 100.Unless stated otherwise, spectral clustering is performed by using the 100 smallest eigenvectors of the hypergraph laplacian.Similar to the parameter setting for CDR-Single (Sec.VI-A2), we fix the replication factor r to 3.
Evaluation Metrics: The metrics used to evaluate the effectiveness of OverlapH in solving the CDR-Multi problem are stated as follows.
• Efficiency: Similar to the CDR-Single case (Sec.VI-A2), efficiency of the methods is measured using their execution time.
• Efficacy: is measured using multiple metrics, 9 namely the node span S(•) (Span), outgoing traffic cost (•) (Traffic), inter node latency κ(•) (Latency), storage cost (•) (Storage), and a weighted sum (Obj) of the four aforementioned metrics using the weights prescribed by W. Similar to the CDR-Single case (Sec.VI-A2), we normalize the results obtained from different techniques corresponding to each evaluation metric in the scale of [0, 1] as it provides an intuitive way to analyze their relative performance.Furthermore, normalization ensures equal and fair contribution of each evaluation metric towards Obj as all values lie in the common range [0, 1].Lastly, it is important to note that since the optimization problem underneath CDR-Multi is concerned with the minimization of the aforementioned evaluation metrics, smaller values imply better performance.

3) RESULTS: EFFICACY
It is evident from the results portrayed in Figs.6a-6d that OverlapH is the best performing technique in terms of the weighted sum of the considered metrics, i.e., Obj, which is observed across each of the four different weight vector settings W 1 -W 4 considered in the evaluation.Specifically, OverlapH substantially outperforms Random and Nearest, and is around 30-40% and 20-30% better when compared to Hyper and Spectral, respectively.
Redirecting our focus to other evaluation metrics, it can be noticed that Nearest outperforms Hyper, Spectral, and Over-lapH in some cases, however, the latter are still significantly better than the Random method.For instance consider Fig. 6a, it can be observed that Nearest is better on the inter node traffic and latency metrics.This is because according to the weight vector setting W 1 , minimizing the node span holds the highest priority while traffic and latency metrics have lower weights in the optimization objective.A similar behavior is observed for the other three weight vector settings: W 2 , W 3 , and W 4 as well (Figs.6b-6d).To understand this observed behavior better, let us analyze the results presented in Fig. 6d.It is not hard to infer that storage cost might be inversely related to other parameters such as inter node latency and traffic.Therefore, preferentially optimizing to achieve lower storage costs (W 4 ) thereby also obtaining better performance on Obj, might lead a technique to suffer on other metrics, i.e., a lower storage cost might lead to higher latencies or traffic cost.Despite this behavior, most importantly OverlapH significantly outperforms all the considered baselines on the corresponding evaluation metric that the weight-vector setting is tuned to optimize.More fundamentally, in addition to being better on Obj, OverlapH outperforms the other methods in minimizing the node span S(•), inter node traffic cost (•), inter node latency κ(•), and storage cost (•), when a higher preference is given to these metrics under the weight-vector settings W 1 , W 2 , W 3 , and W 4 respectively.
Moving ahead, we analyze the reason behind the suboptimal performance of the Nearest method.The main limitation is that Nearest is inclined to assign each data-item to a node that receives the highest number of access requests for that data-item, which consequently results in minimizing (on an average) the geographical distance between the dataitem and the source location of the data-request.Note that this optimization strategy is oblivious to the fact that the storage or traffic costs might not be correlated with the distance, thereby leading to sub-optimal performance in real-world settings that require multi-objective optimization.We also refer the reader to Table 2, which presents a quantitative summary of the performance of all the considered baselines indicating how worse each baseline is relative to OverlapH.
Based on the above analysis, it is clear that Hyper, Spectral, and OverlapH possess the capability to adapt the optimization based on the input weight vector setting.This is TABLE 2. Quantifying the performance of the considered baselines relative to OverlapH on the evaluation metrics.
because of their higher-order modeling capabilities courtesy hypergraphs, which renders them better suited for performing multi-objective optimizations.Further, since OverlapH models data placement and replication as a joint optimization problem (CDR-Multi), it achieves better performance on the evaluation metrics when compared to both Hyper and Spectral that solve each problem independently.

4) RESULTS: EFFICIENCY AND SCALABILITY
In this section, we analyze the execution time performance of Hyper, Spectral, and OverlapH-the three techniques that stand out in terms of performance on efficacy related evaluation metrics (Sec.VI-B3)-on the Gowalla dataset.It is evident from Fig. 7 that OverlapH substantially outperforms Hyper and Spectral in terms of execution time efficiency.Specifically, OverlapH achieves an average speed-up of ≈ 4-5 and ≈ 2-3 times over and above Hyper and Spectral, respectively.The capability of scaling to large datasets is a desired property in any CDR placement algorithm.Over-lapH, with its ability to gracefully scale to large scale social networks, stands strong on this requirement, thereby being highly advantageous in real-world scenarios when compared to Hyper and Spectral.
To summarize, our extensive experimental evaluation portrays the efficiency, scalability, and effectiveness of OverlapH in solving the CDR-Multi problem.Additionally, its ability to perform data and replica placement in a single step allows for a better and unified system design.

VII. CONCLUSION AND FUTURE WORK
The problem of combined data and replica placement has been addressed in this article.Although replication is an integral part of data placement, we identified that most of the techniques in the literature do not address the two placement steps as a single joint optimization problem, but rather treat them as two independent problems.Hence, existing techniques employed a two-stage approach: performing data placement followed by replica placement.Consequently, with the objective of combining data and replica placement we proposed a unified paradigm, called CDR, which facilitates the two problems to be studied jointly.We proposed two variants of the CDR problem: CDR-Single and CDR-Multi with applicability in addressing use-cases under two interesting real-world application domains: OLAP and OSN, respectively.To effectively solve the CDR problem (and its variants), we proposed a generic framework, called UnifyDR, which possessed the capability to unify data and replica placement.We also proposed two algorithms, namely -OverlapG and OverlapH, possessing the capability of partitioning a set of data-items by allowing each data-item to be assigned to multiple nodes, thereby facilitating joint optimization of data and replica placement.While OverlapG performed overlapping correlation clustering on a graph to address the CDR-Single problem, the OverlapH algorithm performed overlapping clustering on hypergraphs for solving the CDR-Multi problem.To evaluate the effectiveness and efficiency of the proposed algorithms OverlapG and OverlapH, we performed experiments on data simulated using a decision support benchmark and a trace-based social network dataset respectively.It was identified that the proposed algorithms are approximately 20-30% better on the evaluated metrics while being 2-8 times faster.
Currently, UnifyDR and its algorithms (OverlapG and OverlapH) perform a static analysis of the considered workload to learn a data and replica placement strategy.In other words, they lack the ability to manage dynamic workloads.In the future, the focus will be to make UnifyDR and the underlying algorithms capable of handling updates in the data in an online manner, and dynamically updating the CDR placement output.

FIGURE 1 .
FIGURE 1.The standard two-phase data placement process (in green): where the data-items (black dots) are first placed in nodes and then replicated (red dots); and the proposed CDR paradigm (in magenta): where data and replica placement are jointly performed in a single step.

Definition 1 (
Data-Items (D)): A data-item is defined as an atomic unit of data storage and transfer.D denotes the set of data-items, where |D| = n.Definition 2 (Nodes (N )): A node constitutes a set of resources to store the data-items and perform different computational tasks on the stored data-items.Nodes are denoted using the set N , where |N | = l.

FIGURE 2 .
FIGURE 2. An example of OLAP join query.

FIGURE 4 .
FIGURE 4. Overview of the proposed UnifyDR framework for combined data and replica placement.
) Data-item -node hyperedges (R N ): This hyperedge captures data-item -node associations by connecting each data-item contained in the data-request triggered in response to a user check-in to the node location of the registered check-in.The set of data-items D and nodes N constitute the hypergraph vertex set V H , resulting in a total of |V H | = n = n + l vertices.Similarly, the hypergraph edge set E H is composed of the data-request pattern hyperedges R and the data-item -node hyperedges R N , totaling to |E H | = m = |R| + nl hyperedges.These two sets are formally defined as follows.

Algorithm 1
CDR Placement Algorithm Input: D, R, N , r, l, Q, C Output: Partitioning of the set of data-items P(D) into l nodes allowing r replicas 1: procedure OverlapG(D, R, N , r, l, W A ) ← ConstructGraph(D, R, N , Q) 3:

) Algorithm 2 3 :
Overlapping Clustering Algorithm Input: V , M sim , l, r Output: Partitioning of the (hyper)graph vertex set P(V ) into l clusters allowing r replicas 1: Randomly initialize the label sets of size r for each dataitem u ∈ V 2: while L Overlap (V , F) decreases do for each u ∈ V do 4:find the label set F that minimizes L u Overlap (

FIGURE 5 .
FIGURE 5.Analyzing the performance of OverlapG on the considered evaluation metrics.OverlapG achieves (a) a similar node span S(•) as Metis when r = 0; while (b) it results in a reduction of ≈ 31% when compared to Metis+H2 with r = 3.The execution time of OverlapG is (c) similar to Metis for r = 0, while it is (d) ≈ 1.3 times faster than Metis+H2 for r = 3.
(a) Traffic and Storage Costs, and (b) Inter Node Latency based on Geo-distributed Amazon Clouds.