Targeted Influence Maximization Based on Cloud Computing Over Big Data in Social Networks

This paper focuses on the targeted influence maximization based on cloud computing in social networks. Most of existing influence maximization works assume that the influence diffusion probabilities on edges are fixed, and identify the Top- $k$ users to maximize the spread of influence assuming the knowledge of the entire network graph. However, in real-world scenarios, edge probabilities are typically different based on various topics, and may be affected by information received. Meanwhile, obtaining complete network data is difficult due to privacy and computational considerations. Moreover, existing influence maximization algorithms considering target users do not discuss cloud computing which lead to low computational efficiency when dealing with big datasets in social networks. To this end, this paper proposes a targeted influence maximization solution based on cloud computing. First, a new topic-aware model called tag-aware IC model is presented, which takes into account users’ interests, characteristics of the item being propagated, and the similarity between users and the related information. Then, efficient algorithms with approximation guarantee are provided using a bounded number of queries to the graph structure. These methods aim to find a seed set that maximizes the expected influence spread over target users who are relevant to given topics. Finally, empirical studies of the proposed algorithms are designed and performed on real datasets. The experimental results show that the techniques in this paper achieves speedup and savings in storage compared with the state-of-the-art methods.


I. INTRODUCTION
Online social networks play a critical role in the spread of information, ideas, and influence among people in their modern life [1]- [6]. The new generation of social networks contains billions of nodes and edges. They have become one of the most representative sources of big data. Managing and mining big social data is a challenge for academic and industrial. Social networks have been actively used as a propagation and marketing platform. For instance, in viral marketing, a marketer tries to select a set of customers (called seed set) The associate editor coordinating the review of this manuscript and approving it for publication was Francesco Piccialli.
with great influence on a promoted product. With a fixed budget on the number of selections, a marketer aims to maximize the number of customers finally adopt the product. This is the classical influence maximization problem [7]- [13].
Since the influence maximization problem is NP-hard [14], Kempe et al. [15] first propose a greedy algorithm to solve the problem, which returns a seed set with a 1 − 1 e − approximation ratio to the optimal solution. However, the greedy solution still takes a prohibitively long time to finish. To address the efficiency issue, most researches propose a variety of scalable influence maximization algorithms [16]- [22]. Most of scalable approaches require complete knowledge of the network, and focus on finding global influential nodes over the entire social networks to maximize the expected influence spread. However, in practice, it is difficult and costly to collect the entire network information in social networks with big data. Therefore, this paper combines influence maximization with cloud computing to improve the performance.
In recent years, topic-aware influence maximization is emerging due to practical applications demands. It is extended from the classical influence maximization problem. Topic-aware influence maximization algorithms assume that the influence diffusion probabilities between two users are different based on various topics. However, there are still important challenges not solved in real applications of topicaware influence maximization. One of them is that the influence strength between two users in real world may depend on different topics, and may be affected by various information received. In an online social network, e.g., Weibo, each user is associated with several tags, which represent one's interests. These tags can be obtained based on hashtags and representative keywords from the contents. Moreover, each information is associated with tags that can be obtained by language processing algorithms. Let us consider an example. There is a user x of Weibo, y is a friend of x. In real life, x and y are both college teachers for computer science. At one day, x bought a programming book and sent a content about the book with pictures and comments. Meanwhile, y received advertisements about two books. One is about this programming book and the other one is about literature. y was interested in the programming book instead of the literature one because it was related to his career. Finally, y bought the programming book because he was influenced by the friend x.
Motivated by the above observations, this paper focuses on targeted influence maximization with partial network information based on cloud computing. The goal is to find the Top-k influential nodes who are relevant to given topics for maximizing the influence spread within target customers. First, a new topic-aware model, which is extended from the IC model, is proposed. Then, as the optimization problem is NP-hard, efficient algorithms with approximation guarantee are provided. The main idea is to adopt the topic aggregation strategy and a bounded number of queries for constructing a targeted sketch under the proposed model. Based on this, the greedy algorithm is used on the maximum coverage problem to find the Top-k seed users. Finally, experimental evaluations are conducted on several real-world social networks. The experimental results verify the effectiveness and efficiency of proposed algorithms.
The rest of the paper is organized as follows. Section 2 presents some background and related works. Section 3 introduces the problem studied in this paper and proposes a new topic-aware model. Section 4 designs the algorithms for targeted influence maximization using a targeted sketching technique. Section 5 demonstrates the efficiency and effectiveness of proposed algorithms on several real social networks. Section 6 summarizes the paper and gives relevant conclusions.

II. RELATED WORK
The influence maximization is a NP-hard problem that has been extensively studied. Kempe et al. [15] are the first to formulate influence maximization as a combinatorial optimization problem. They present a greedy approach that yields (1−1 e − ) approximate solutions. This method evaluates the influence spread of any seed set by the Monte Carlo (MC) simulations, which leads to be computationally expensive. To improve computational efficiency, most researches propose a variety of scalable influence maximization algorithms. These algorithms can be classified into three categories: the greedy-based approach, the heuristic-based approach and the sketch-based approach.

A. GREEDY-BASED APPROACH
The first category of approaches focuses on the optimization for MC simulations. Methods such as CELF [30], CELF++ [31] and UBLF [32], reduce the number of estimation of influence function by a lazy evaluation technique. The technique of [33] utilizes community structure to reduce the number of nodes to be estimated. Although improving the efficiency of the greedy algorithm [15], these approaches still require significant computational overhead when selecting the seed set on graphs with billions of edges. Therefore, the study of influence maximization starts to focus on the heuristic-based and the sketch-based approaches.

B. HEURISTIC-BASED APPROACH
The second category, i.e., the heuristic-based approach, relies on approximate scoring mechanisms to estimate the influence spread of the seed set instead of running heavy MC simulations. This makes influence maximization algorithms more scalable on larger graphs. Many methods generate the seed set according to a ranking metric [34]- [37]. For instance, degree discount [34] adopts a simple discount metric for influence estimation. Group-PageRank [35] and IRIE [36] use the PageRank metric. Methods such as SP1M [39], PMIA [40], IPA [41], LDAG [40], SIMPATH [42] and EASYIM [43], leverage the idea that influence of a node can be estimated using a function of the number of simple paths starting at that node. Although improving practical efficiency, the heuristicbased approach loses theoretical guarantee compared with the greedy-based approach.

C. SKETCH-BASED APPROACH
The third category of approaches overcomes the drawbacks of the greedy-based and the heuristic -based approaches. It achieves a balance between theoretical guarantee and practical efficiency. The sketch-based approach first precomputes a large number of sketches, and then utilizes the sketches to evaluate influence spread. Therefore, it avoids rerunning the MC simulations. In this paper, a sketching technique is proposed to solve targeted influence maximization. Methods such as NewGreedyIC [34], StaticGreedy [44], PrunedMC [45] and SKIM [44], construct sketches by examining the entire graph for estimating the influence spread. However, the time complexity is still too expensive to run on graphs with millions of nodes and billions of edges. For reducing the time consumption, Borgs et al. [23] first propose the concept of the Reverse Reachable (RR) set. The RR set consists of the nodes that can active the randomly sampled node. Moreover, Borgs et al. [23] propose a method called Reverse Influence Set (RIS). This method keeps generating RR sets until the total number of edges examined during the generation process reaches a pre-defined threshold. TIM [24] and IMM [25] are techniques that further improve over RIS.

D. TOPIC-AWARE INFLUENCE MAXIMIZATION
In the real world, the influence diffusion probabilities between two users are related to various topics. However, all the above algorithms are not topic-aware. To meet specific applications, topic-aware influence maximization studies are emerging [46]- [48]. The existing topic-aware influence maximization studies can be classified into two categories. The first category considers the users are topicaware, and the second category formalizes the edges are topc-aware. Aslay et al. [46] are the first to study this problem. They design an index-based approach INFLEX with pre-computation and similarity search schemes. INFLEX first samples a set of topic distributions, and pre-computes the seed set under each distribution by CELF++ [31]. At query time, given an online query, the method finds a sufficient set of pre-computed distributions similar to the online query, and combines the corresponding seed sets by using a rank aggregation technique. In [46], Aslay et al. apply maximum-likelihood Dirichlet estimation for sampling topic distributions, Bregman-ball tree for fast similarity search, and Kendall-τ distance-based schemes for seed set aggregation. Chen et al. [47] adopt a similar framework and develop optimization techniques, which are suitable for some special graphs with properties like topically-separable and sub-additive. As [46] and [47] have no theoretical guarantee on the influence spread, Chen et al. [48] propose the improved method. They develop algorithms having a bounded approximation ratio under PMIA [40]. Specifically, algorithms use the samples to estimate upper and lower bounds for pruning instead of directly answering the query in [46], [47].
Compared with existing topic-aware influence maximization solutions, all users and user-to-user influence strength are both topic-aware in this paper. Moreover, seed nodes are selected depending on multiple factors.

III. PROBLEM FORMULATION
A social network is generally modeled as a graph G = (V , E, P), where V is the set of nodes (i.e., users) in G, E ⊆ V × V is the set of directed edges in G. To simulate the information propagation process in the social networks, a number of methods such as Independent Cascade (IC) model and Linear Threshold (LT) model [15] have been proposed.
In this section, IC model is introduced first, and then a new topic-aware model is proposed, which is extended from IC model.

A. IC MODEL
Under the classical IC model [15], each directed edge e = (u, v) has an independent influence probability P (e), which measures the social impact from user u to user v. Each user is either in an active state or inactive state, and an active user can active his inactive neighbors with probability P (e). The dynamic information propagation process under the IC model unfolds in the following discrete steps. Initially, at time step 0, a selected set of seed users S are activated, while setting all other users are inactive. If a user u is first activated at time step t, then for each of his currently inactive outgoing neighbors v, u has probability P (e) to activate v at time step t + 1. After time step t + 1, u cannot activate any user. Once a user is activated, his active state remains unchanged. The influence diffusion process runs until no more new nodes can be activated.
The IC model is generally considered equivalent to the possible world semantics. Let G = (V , E G ) denote a possible world, where E G ⊆ E. It is one certain instance of the input uncertain graph G, and is obtained by independent sampling of the edges. Therefore, the probability of existence of each possible world G as follows.
Given a set T ⊆ V of customers, the targeted influence spread in the possible world G is defined as the number σ G (S, T ) of target nodes that are reachable from the seed set S in G, i.e., where the indicative function I G (S, t) = 1 if at least one node in S is reachable to t in G, otherwise, I G (S, t) = 0. The expected influence spread σ (S, T ) on the target set T from the seed set S can be calculated as the expectation of the number of reachable target nodes from S, i.e.,

B. TAG-AWARE IC MODEL
In the classical models, e.g., IC and LT model, the influence probability P (e) is normally set to a constant one or 1/d v , where d v is the in-degree of node v. However, in real applications of social networks, the influence probability may depend on various topics, and may be affected by information received. Therefore, a new tag-aware IC model is proposed in this section, which is the extended IC model considering multiple factors. Let notation C denote the set of all tags present in the social network. These tags contain not only users' interests, but also characteristics of information being propagated.
Appearance of specific tags in the social network affects the corresponding influence diffusion probabilities between two users. Therefore, the influence probability in the classic IC model changes to be a function P : E × C → (0, 1]. It assigns a conditional probability to every edge (u, v) ∈ E given a specific tag c ∈ C. In other words, P ((u, v) | c) is the probability that a user u will influence his neighbor v, given the tag c.
Similarly, in the tag-aware IC model, a social network is represented by a graph G = (V , E, P), and each node is associated with active or inactive state. In the new model, the nodes and influence strength on the edges are both topicaware. In other words, each node v ∈ V is associated with a set of tags C v ⊆ C, and each special message M is specified with a set of tags C M ⊆ C. Hence, the information diffusion procedure under the tag-aware IC model can be explained as follows.
Given a social network G, a target set T , a message M with a set of tags C M ⊆ C, and a seed set S, the information diffusion process unfolds in discrete steps.
• At time step 0, all nodes in S will become active, and all other nodes will set to be inactive.
• At time step t, for each active node u which is just activated, u will try to activate its every inactive neighbor v. For each tag c ∈ C M , u will influence its neighbor v with probability P (e | c). If v is influenced by u for all tags in C M , then v will calculate the similarity Sim v tag , M tag between message M and itself, i.e., It is the probability that v will continue with the next step, otherwise, v will remain inactive. Next, the following threshold will be computed where N v is the set of neighbors of v andN v is the set of active nodes in N v . If node v is activated, it will continue to activate its neighbors.
• The procedures above run until no more new nodes can be activated.
The tag-aware IC model assumes that tags are independent of one another. Hence, given a set of tags C M ⊆ C, the information diffusion probability via an edge e can be derived as P (e | C M ) = 1− c∈C M (1 − P (e | c)). According to the information diffusion process of tag-aware IC model, all edge probabilities depends on P (e | c) and Sim v tag , M tag . Hence, the probability of each possible world G = (V G , E G ) in the tag-aware IC model can be calculated as follows.

Prob [G|C
Analogously, the expected influence spread σ (S, T , C M ) of the seed set S in G, given the target set T and the tag set C M , is computed as: C. PROBLEM DEFINITION Given the social network G = (V , E, P), the target set T , and the size of the seed set k, the targeted influence maximization aims at finding the Top-k seed nodes, such that the influence spread within the target users is maximized.

IV. ALGORITHMS FOR TARGETED INFLUENCE MAXIMIZATION
The goal of this paper is to find a seed set that maximizes the influence inside a target set of users considering various topics. This section presents efficient algorithms for targeted influence maximization. For addressing targeted influence maximization using partial information of the network, some ideas are adopted from [49], [50]. At a high level, the proposed solution consists of two phases: limiting for targeted sketching and seed selection.

A. TARGETED SKETCHING
This section proposes a targeted sketching technique with the approximation guarantee (1−1 e − ). Let G denote a deterministic sub-graph of the input uncertain graph G. It is generated by removing each edge e ∈ E with probability (1 − P (e | c))· 1 − Sim v tag , M tag independently based on tag-aware IC model. Since a sub-graph G is a combination (i.e., graph union) of each chosen tag's individual uncertain graph, random samples are first built on uncertain graphs by considering each tag separately. First, only the edges e associated with a tag c in G are kept. Then, for each node, its neighbors are probed based on tag-aware IC model, keeping the edges with probability P (e | c) · Sim v tag , M tag . Each node never be probed more than once and each edge is sampled at most one chance. Any remaining edge e∈ E is removed with probability (1 − P (e | c))· 1 − Sim v tag , M tag . Next, when receiving some message, which is associated with a set of tags C M , the targeted sketching method randomly selects one sample from random sample set of each tag of C M , and combines them together as the sub-graph G. VOLUME 8, 2020 Let τ denote a threshold that limits the number of nodes in each connected component. In other words, if there are more than a threshold τ nodes in a connected component, the process of probing the neighbors stops. Note that the probing may stop even before hitting τ nodes if there is no new nodes can be activated. In the targeted sketching algorithm, the value of each connected component is the number of target nodes in that component. The influence spread of the node is computed by adding the values of all of the components containing that node.
The above procedures are repeated θ times and θ subgraphs of G are obtained.
In [49], [50], it was proved that when θ and τ are sufficiently large, the proposed targeted sketching algorithm returns near-optimal results with high probability.
The set L is constructed is as follows: Starting from (θ ) , remove k nodes randomly, and replace them with k nodes chosen uniformly at random from V . To show that The inequality, It only remains to verify the truth of the former First, E ∅ (θ ) (v, L) represents the probability of node v being connected to one of the nodes in the random set L averaged over the θ copies G (1) , . . . , G (θ) . Consider each of the θ copies in uncut sketch, G (1) , . . . , G (θ) , and the connections between node v and the optimal set (θ ) in these uncut copies. If these connections remain unchanged in the -cut copies G (1) , . . . , G (θ) , then with probability at least (1 − ) they remain unchanged after k nodes in (θ) are randomly replaced. However, if any of these connections are affected by the -cutting, then this is an indication that v belongs to a connected component of size τ . This connected component is large enough to contain one of the k nodes of L with probability at least . Indeed, the probability that one of the τ nodes is chosen is at most : Based on the above analysis, the algorithm of targeted sketching is as follows: Step 1: Input a social network G, a set C M of tags.
Step 2: Keep only the edges associated with a tag c ∈ C from G.
Step 3: Probe every node by trying to activate their neighbors with probability P (e | c) · Sim v tag , M tag .
Step 4: Stop probe if there are no more new nodes can be activated or if the size of the connected component exceeds τ .
Step 6: Retain all nodes of G.
Step 7: Repeat Step 2-6 and obtain random samples of every tag c.
Step 8: Randomly select one sample from random sample set of a tag c ∈ C M , and combine them together.
Step 9: Repeat Step 8 θ times and obtain θ sub-graphs of G.

B. SEED SELECTION
This section provides the algorithm for targeted influence maximization on the sampled graphs. The implementation of this algorithm is based on the structure of θ sub-graphs of G for a specific set of information tags. Since the targeted influence spread on the sub-graphs is submodular, the seed selection phase applies a greedy algorithm to find the seed set with an approximation guarantee (1−1 e − ) to the optimal solution. First, the connected components of each of the θ sub-graphs are found by using a graph search (e.g. BFS). Then, the number of target nodes in each connected component is counted as the value of that component. Therefore, the main idea of the proposed method is that finding a seed set S that maximizes the total value of all connected components containing at least one seed node. If a connected component already contains some nodes in S, then the value of that component is set to zero. The details of the algorithm is described as follows.
Step 1: Input the θ sub-graphs of G, an positive integer k, a set C M of tags, a target set T .
Step 2: Find the connected components of θ sub-graphs.
Step 3: For every connected component in each of the θ sub-graphs, initialize the current value of the component as the number of target nodes in that component.
Step 4: Initial a node set S * = ∅. Step 6: Insert v into S * and set the current value of the connected components containing v to zero.
Step 7: Repeat Step 5-7 until k seed nodes are found.

V. EXPERIMENTAL ANALYSIS
This section experimentally evaluates the targeted influence maximization algorithms against the-state-of-the-art techniques. All the experiments are conducted on a machine with an Intel Core 2.2GHz CPU. All algorithms are implemented in C++.

A. EXPERIMENTAL SETTINGS 1) DATASETS
To demonstrate the effectiveness and efficiency of the proposed algorithm, five real world networks are employed in the experiments. The details datasets is in Table 1

3) ALGORITHMS COMPARED
In this paper, the proposed algorithm is compared with BKRIS [56], D-SSA [57] and IRIE [36]. BKRIS and D-SSA are the best current algorithms using a sketching technique. IRIE is one of the fastest heuristic method. For BKRIS, when estimating the sample size, the parameters and are set as 0.5 and 1 respectively; bk is set as 16 by default for all the datasets. For D-SSA, the parameters is set as 0.1 and δ is set as 1 n, which is the default settings suggested by authors. For IRIE, the internal parameters are set as recommended in the original paper. For the proposed algorithm, the parameters ρ and σ are set as 0.02 and 100 respectively.

4) PARAMETERS SETTINGS
In the experiments, the seed set size k is varied from 10 to 5000. To demonstrate the scalability of all these algorithms, the seed size k = 5000 is only considered. 10  are shown in Fig. 1. For most of the cases, IRIE can return a result with close influence to BKRIS and the new algorithm. So the influence spread is not shown here. The experimental results in Fig. 1 show that the new algorithm developed in this paper consistently outperforms IRIE, and BKRIS always runs faster than IRIE. As BKRIS consistently outperforms IRIE, thus in the following experiments, BKRIS is only compared with to evaluate the performance of the proposed algorithm. Next, the experiment is conducted on all the datasets to compare the effectiveness of BKRIS and the new algorithm. The value of k is set as 5000. Fig. 2 shows that the new algorithm outperforms BKRIS on running time by a wide margin. It is up to two orders of magnitude speedup on all datasets. Fig. 3 reports the running time on all the datasets by varying k from 10 to 5000. As shown in Fig. 3, the new algorithm significantly outperforms BKRIS under all settings, achieving the speedup of two orders of magnitude.

C. COMPARING IN INFLUENCE SPREAD
In this section, the accuracy of BKRIS and the proposed algorithm is compared by showing the influence spread of the obtained seed set. In Figure 4, the effectiveness ratio is evaluated on all the datasets with k to be 5,000. The effectiveness ratio of BKRIS is rather close to 1 for all the datasets. This shows that two methods have the similar performance in terms of influence spread. Compared with Figure 4, these results show that the new algorithm is superior to BKRIS. Because the new algorithm can achieve the similar influence spread with much less running time. For instance, on the largest dataset Friendster, BKRIS takes more than 10 times time compared with the new algorithm to achieve the same result. Fig. 5 reports the influence spread on all the datasets by varying k from 10 to 5000. As shown in Fig. 5, two algorithms achieve the similar influence spread.

VI. CONCLUSION
This paper considers the targeted influence maximization based on cloud computing. The goal is to find the optimal seed set that maximizes the topic-aware influence within target customers. In this paper, a new topic-aware model, in which all users and user-to-user influence strength both are topic-aware, is proposed. Since the targeted influence maximization is NP-hard problem, the algorithms with approximation guarantee are provided. As obtaining full knowledge of the network structure is very costly in practice, a targeted sketching technique is presented based on partial network information. First, random samples are built for each topic under the proposed tag-aware IC model. Then, a targeted sketch is obtained by combining the corresponding samples for given topics using the topic aggregation technique. Based on this, the greedy algorithm is employed to find the Top-k seed users. Performance is evaluated on several real complex social networks with billions of edges and hundreds of topics. The experimental results confirm the effectiveness and efficiency of proposed algorithms.