Multi-Topic Misinformation Blocking With Budget Constraint on Online Social Networks

Along with the development of Information Technology, Online Social Networks (OSN) are constantly developing and have become popular media in the world. Besides communication enhancement benefits, OSN have such limitations on rapid spread of false information as rumors, fake news, and contradictory news. False information spread is collectively referred to as misinformation which has significant on social communities. The more sources and topics of misinformation are, the greater the number of users are affected. Therefore, it is necessary to prevent the spread of misinformation with multiple topics within a given period of time. In this paper, we propose a Multiple Topics Linear Threshold model for misinformation diffusion, and define a misinformation blocking problem based on this model that takes account of multiple topics and budget constraint. The problem is to find a set of nodes that minimizes the impact of misinformation at an allowed cost when blocking them from the network. We prove that the problem is NP-hard and the time complexity of the objective function calculation is $\#P$ -hard. We also prove that the objective function is monotone and submodular. We propose an approximation algorithm with approximation ratio $(1-1/\sqrt {e})$ based on these attributes. For large networks, we propose an extended algorithm by using a tree data structure for quickly updating and calculating the objective function. Experiments conducted on real-world datasets show efficiency and effectiveness of our proposed algorithms in comparison with other state-of-the-art algorithms.

In OSN, information can spread very quickly through network connections. Topics discussed and diffused on OSN can be everything from political comments, business marketing, personal concerns, to entertainment gossip. Besides many positive benefits, OSN can also bring risks to users by spreading fake news and wrong information [1], [2]. Being interested in mitigating the misinformation risks, in this paper we study the problem of modelling misinformation diffusion and propose effective methods to detect misinformation sources and limit its spread.
There have been previous studies for minimizing the impacts of misinformation diffusion in OSN [3]- [6]. A commonly used method in these studies is to disable users and connections that are considered to have major roles in spreading misinformation [7]. Finding users and connections to be disabled is addressed by solving a combination optimization VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ problem. Most of these studies, however, consider only a single source of misinformation belonging to only one topic. In this paper we consider a more realistic scenario where multi-topic misinformation can reach and affect users at the same time. This problem setting poses significant challenges. First, impacts of multi-topic misinformation are proved heterogeneous [8], [9] and outcomes of the model must be re-defined. Second, when a node can adopt multiple topics, it is shown that the overall influence function that counts activated nodes is no longer submodular which is a key property to devise good approximation of the optimal solution.
We develop a new model of misinformation diffusion blocking with multiple topics and budget constraint. An important characteristic of the model is that a node in the network can be activated multiple times by multiple topics. Defining an objective function which is the influence function is defined based on this setting and is proved having monotone and submodular properties. Effective approximation methods for minimizing misinformation spread are then proposed from these monotone and submodular properties of the objective function.
The features and main contributions of this paper are as follows: • A Multiple Topics Linear Threshold (MT-LT) model is developed by extending the Linear Threshold model [10]. Multi-topic misinformation diffusion is modelled based on different degrees of influence and activation thresholds for each topic. Then, a misinformation blocking problem, also called the Multiple Topics and Budget Constraint (MMTB) problem, is formulated by the MT-LT setting.
• We show that the MMTB problem is NP-hard and the calculation of the objective function is #P-hard. We also show that the objective function is monotone and submodular.
• Based on the monotone and submodular properties of the objective function, we propose efficient and effective algorithms for solving the MMTB problem. The first algorithm, called IGA, is an approximation greedy algorithm with approximation ratio (1 − 1/ √ e). The second algorithm, called GEA, is capable of running on large OSN by using a tree data structure for quickly updating and calculating the objective function. Proposed algorithms are tested on real-world datasets including Gnutella, NetHepP and Epinions. Experimental results show that our proposed algorithms outperform other algorithms in terms of both efficiency and scalability. In particular, IGA is more effective in preventing the spread of misinformation by blocking super-influencing nodes, and GEA can be applied to medium and large networks.
Organization: The structure of the paper is organized as follows. Section I introduces an overview of the proposed work. Section II presents related works. Section III introduces the network model and research problem. Section IV presents our proposed algorithms and section V provides experimental results on some selected datasets. Section VI concludes the paper.

II. RELATED WORKS
Kempe et al. [10] formulated stochastic discrete optimization problem under the Independent Cascade Model (IC model), and Linear Threshold Model (LT model). Kempe's research is inspired by Sebastos and Richardson's research on information spreading using data mining techniques [11]. The problem in [10] is formulated as follows: given a network, a diffusion model, a set of influence weights for edges, a random threshold function, and a budget; selecting nodes so that the final number of infected nodes is maximized. For the problem they formulated, Kempe et al. proposed a greedy algorithm with an approximation guarantee of (1 − 1/e). Later, many studies on information diffusion and misinformation spreading prevention problems on online social networks have been undertaken [12]- [14]. The authors in [15] studied the problem of eliminating k-edge sets so that influence of the S-sourceset is minimal and introduced an algorithm for approximation (1 − 1/e − ), wherein e and ∈ [0, 1] to solve the problem. From the epidemiological perspective, some authors injected immunization vaccines into sets of vertices to be immune to bad information [4], [5], [16]. The authors in [17], [18] studied the DAVA probslem (Data-Aware Vaccination) with a request to inject the vaccine into the k-vertices of the user set. In [19], the authors extended the DAVA problem by adding time to spreading the disease.
On the other hand, many researches followed an approach of spreading good information to prevent impact of bad information called information purification method [3], [20], [21]. The authors in [22] proposed the MCIC (Multi -Campaign Independent Cascade) information diffusion model that allows multiple sources of information to be spread simultaneously on the same network. For the same purpose, in [23] the authors studied to prevent influence of misinformation on the linear competition model. In addition, the authors in [6] studied the TIB (Temporal Influence Blocking) problem to limit misinformation by time delay. The authors in [24] studied the new β I T problem with a goal of selecting the smallest seed set to start spreading good information to eliminate bad information.
Recently, the author in [25] studied misinformation containment with multiple cascades. In [26] the authors investigated rumor blocking within a given community. In [27] the authors proposed a scalable algorithm which guarantees approximation ratio of (1 − 1/e − ) for epidemic blocking problem by edges and nodes blocking. In [28] the authors studied influence blocking which considers location of competitors. The authors in [29], [30] proposed a method for misinformation prevention by eliminating nodes in multiple contexts. Furthermore, several studies have focused on identifying and detecting misinformation which is an important step for issues that prevent misinformation. The authors in [31], [32] relied on structure and language characteristics to identify false information. Some studies used data mining and machine learning methods to detect misinformation by user behavior analysis such as shares, comments, and likes [33]- [35].

III. MODEL AND PROBLEM FORMULATION
The IC model and the LT model are two of the most widely used models in the research of information diffusion problem on online social networks. In the IC model, an active node u may attempt to activate a neighboring inactive node v only once with successful probability p (u, v). IC model can be seen as a sender-central model. In the LT model, every node contributes to activation of their neighbors. So, LT model can be considered as a receiver-central model. With reference to the problem of preventing spreading misinformation, the LT model, having more advantages, is more well-suited than the IC model. The collective contribution of active nodes in activating their neighbors in the LT model can be seen as herding effect, which is very close to mechanism of spreading false rumors where the decision is more likely to be made by mimicking others' decision. In this section, we formulate a Multiple Topics Linear Threshold (MT-LT) model by extending the LT model. This MT-LT model considers multi-topic misinformation diffusion with budget constraint. Next, we present the traditional LT model. All the symbols and notations used in the paper are given in Table 1.

A. INFORMATION DIFFUSION MODEL 1) LINEAR THRESHOLD MODEL
In the LT model, an online social network is represented by a graph G = (V , E, w) in which V is a node set, E is a directed edge set, |V | = n, |E| = m and N in (v), N out (v) are the set of incoming neighbor nodes, outgoing neighbor nodes of node v, respectively. Each edge (u, v) ∈ E is assigned with a weight w(u, v) ∈ [0, 1] representing the influence of node u on node v, if w(u, v) / ∈ E then w(u, v) = 0. Weights are distributed such that the sum of weights of neighboring nodes to a node v satisfies the following condition: Suppose that S 0 ⊆ V is the set of nodes which spreads misinformation and it is called the seed set. In LT model, each node may have one of two states: active and inactive.
Each node v ∈ V has an activation threshold γ v ∈ [0, 1], if γ v is large, many neighbor nodes are required to activate v, if γ v is small, node v can be easily activated by its neighbors. In many related works, threshold values are determined randomly over the [0, 1] segment. In practice, threshold values can be learned via data mining techniques based on user actions in the past. Thus, threshold values can be viewed as an input to the model instead of assuming a random threshold function. Let D t (G, S) the set of nodes activated by S at time step t in graph G(V , E, w). The LT model operates in discrete time steps as follows: • At time step t = 0, the set of nodes in the active state is the source of the original information diffusion S 0 (seed set).
• At time step t ≥ 1, all nodes activated by S in time step t − 1 are still active. A node v currently not activated by S will become activated if the following condition satisfies: • The diffusion process ends when no node is activated in the next steps.

2) MULTIPLE TOPICS LINEAR THRESHOLD
The LT model considers diffusion of a single topic, or single information cascade. Motivated by LT model, a more realistic scenario is studied in this paper where we assume that there are multiple existing topics being diffused. Topics may have different characteristics, such as their content and impressiveness. When there are multiple topics, we need to redefine outcomes of the model when two or more topical information reach one user at the same time. The LT model can not be applied directly to solve the problem of multi-topic information diffusion because it is hard to capture complex correlations between topical cascades. Earlier researchers have worked on a scenario where there are more-than-one topics being diffused. When multiple topics exist, the influence maximization problem can be elusive as even not being monotone [25]. When a node can adopt multiple cascades, it is shown that the overall influence function that counts activated nodes is no longer submodular. In this paper, we deal with this problem by developing a new model of misinformation diffusion blocking with multiple topics and defining the overall influence function that counts activated turns instead of activated nodes.
In MT-LT model, a social network is also represented by a graph G = (V , E, w) where V is the set of nodes, E is the set of edges, |V | = n, |E| = m. N in (v), N out (v) are the set of incoming neighbor nodes, outgoing neighbor nodes Weights are distributed such that the sum of the weights of nodes u to node v satisfies u∈N in (v) Suppose that there are q misinformation topics such as Economics, Politics, Sports, and so on, and the set of misinformation-spreading nodes is S = {S 1 , S 2 , . . . , S q }. S i contains nodes spreading information of topic i (referred to as source nodes). We can assume that the social network administrator knows where the source of misinformation is. The set of source nodes spreading misinformation on q topics Each node v ∈ V may be activated multiple times by multiple topics. That means, node v may have multiple statuses in the set of q + 1 status as follows: Q = {inactive, active − 1, active − 2, . . . , active − q} which shows the behavior and activity of v. If node v is inactive then v is not activated by any topic; if v is active − i then it has been activated by topic i. If node v has the status active − 1, active − 2, . . . , active − k, 1 ≤ k ≤ q then it has been activated by k topics.
In practice, the impact weight among nodes depends on topics. For example, a spreading topic about a plague may have greater impact than a topic about a sport game to an user. Therefore, a node v is assigned with a vector of activation thresholds represents the effect of v on its neighboring nodes by topic i.
The process of information spread in model MT-LT occurs in separated time steps t = 1, 2, . . . , d where d ∈ Z . We consider the same allowed period for each step of information spread. It is because all neighbors of a node might not influence it simultaneously, but within a certain time window. Let D t i (G, S) be the set of nodes activated by S i at time step t in graph G.
• At time step t = 0, all nodes in S i have the status active − i • At time step t ≥ 1, all nodes activated by S i in time step t −1 are still active. A node v currently not activated by S i will become active − i if the following condition satisfies: The spread process ends when no node is activated in the next steps. The LT model is a special case of MA-LT model when be the total number of nodes activated by topic i after the spreading process ends. D i (G, S) is calculated by the summation of D t i (G, S) over all time steps. The total number of activated turns by all topics after the spreading process ends, denoted by D(G, S), is given by: In this setting a node can be activated multiple times by multiples topics not just once as in previous works. By this setting we can prove the monotone and submodular properties of the overall influence function. These properties are important because they can help to devise effective approximation algorithms.

B. PROBLEM DEFINITION
In this paper, we aim at blocking a set of nodes in the graph G so that the final number of infected turns is minimized. This similar optimization objective can be found in other works [5], [16], [36]. A blocked node cannot be infected by any other nodes, and it cannot infect any other nodes as well.
To block a node v in a graph, we simply set the weights for all incoming edges to v and outgoing edges from v to zero.
Given a graph G, we denote by G A the graph after blocking a set of nodes A. The number of all activated turns by all topics after blocking the set of nodes A is given by The objective here is to minimize D(G A, S). This is equivalent to maximizing the following quantity: Suppose that blocking a node v costs c(u) and the total of costs cannot exceed the budget limit B. The MMTB problem can now be formulated as follows.
Definition 1: (MMTB). Given a social network represented by a weighted graph G(V , E, w), a source of misinformation with q topics given by S = {S 1 , S 2 , . . . , S q } where S i contains source nodes of topic i. The task is to find a set of nodes A to block so that the quantity σ (G, S, A) is maximized under the budget limit B, c(A) = u∈A c(u) ≤ B. We prove that the MMTB is NP-hard under the MT-LT setting.
Theorem 1: The MMTB is an NP-hard problem.
Proof: To prove MMTB as an NP-hard problem, we construct a derivative problem from the well-known Knapsack problem which is also NP-complete.
Knapsack problem: Given a set Q of n items, each item i has a weight w i and a value c i (c i and w i are integers) and two positive integers: W , C. The problem is to find a vector be an instance of the Knapsack problem, and I 2 = (G, S, B) be an instance of the MMTB problem where S is the set of misinformation source nodes, B is the budget limit, we construct an reduction from I 1 to I 2 as shown in Fig. 1.
Reduction: To construct the reduction, we construct a graph G(V , E, w) satisfying the MT-LT model as follows. Given the set S with a single node S = {s}. For each c i (the value of the i-th item) we create a path of c i + 1 nodes:  We prove that I 1 has the solution x = (x 1 , x 2 , . . . , x n ) if and only if I 2 has the corresponding solution According to the MT-LT model, when blocking the node u i,1 all sub-sequence nodes on the path

then
A is the answer of I 2 . (←) By contrast, if A is a solution of I 2 then A cannot contain the node u i,j≥2 , because the cost of blocking the nodes will exceed B. On I 1 we select a vector x = {x 1 , x 2 , . . . , x n } on a condition that: x is an answer of I 1 . In other words, if we can find the optimal solution of the MMTB problem, we can find the optimal solution of Knapsack problem. Thus MMTB problem is NP-hard.
We now prove that the problem of calculating the objective function in formula 2 is #P-hard.
Theorem 2: The problem of calculating function σ (·) is #P-hard in MT-LT even if the set of A has only one node.
Proof: We prove that the calculation of the objective function is #P-hard even for the case the set S has only one node. Let P(G, s) be the set of all simple paths of G starting from s (simple paths are paths that visit each node just one time), P(G A, s) be the set of all simple paths of G starting from s when A is blocked. We have that σ (G, S, A) = D(G, S) − D(G A, S) is exactly the number of nodes in P(G, s) minus the number of nodes in P(G A, s). If we can calculate the number of nodes in P(G, s) then we can also count the number of simple paths in P(G, s). Counting all such simple paths is exactly the s-t paths problem which is proved #P-hard by Valiant [37]. Therefore, our problem is also #P-hard.

IV. PROPOSED ALGORITHMS
In this section we propose two algorithms to solve the MMTB problem. Both algorithms are based on greedy algorithm approach. The first algorithm called Improved Greedy Algorithm (IGA) is based on the ratio between the increase degree of target function and the cost of blocking the node ensuring approximation ratio (1 − 1/ √ e). The second algorithm called Greedy Extension Algorithm (GEA) is based on the idea of quickly updating the target function and the approximate average denominator calculating method.
A. IMPROVED GREEDY ALGORITHM-IGA First, we show that the target function σ (G, S, A) is monotone and submodular. Based on these features, and by adopting the greedy strategy proposed in [19] we are able to obtain an algorithm with approximation ratio (1−1/ √ e). The proposed algorithm is called IGA (Improved Greedy algorithm).
From each original graph G = (V , E, w) under MT-LT model, we construct q graphs: We show that the total number of activated turns on graph G on the MT-LT model with source S is equal to the number of nodes activated on graph G i on the LT model with source S i , for any i = 1, 2, . . . , q. This result is proved in the following lemma: Proof: Because p i u ≤ 1, for each node u ∈ G i we have: Proof: Let E(A) be the set of edges which have at lest a node in node set A. We have: . E T ,v is the set of edges connecting to v but not to any node in the set T , E A,v is the set of edges connecting to v but not to any node in the set A. We have E T ,v ⊆ E A,v for A ⊆ T . We easily see that E(A) ∪ E T ,v ⊆ E(A + {v}). Given two set of edges X , Y , X ⊆ Y ⊂ E, an edge e ∈ Y \ X . By Theorem 6 in [16], we have: Applying the above inequality we have: This complete the proof.
Theorem 3: The function σ (·) is submodular and monotone on the MT-LT model.
Proof: From the definition of σ (G, S, A) in Eq. (2), we have: is monotone and submodular. Therefore, σ i (G, S i , A) is monotone and submodular function. σ (G, S, A) is a collection of monotone and submodular functions, so it is also a monotone and submodular function.

Algorithm 1 Improved Greedy Algorithm (IGA)
Input: G = (V , E, w), source set S, budget B > 0 Output: set of nodes Based on the results of Theorem 3 and using the greedy strategy proposed in [38], we propose an innovative greedy algorithm called IGA that has approximation ratio (1−1/ √ e) (Algorithm 1). The algorithm consists of 2 phases. The first phase uses greedy strategy to find the set of nodes to block A. In each step, we choose a node v with δ(v) is the largest. δ(v) is calculated as follows: The process ends when the cost for blocking nodes exceeds the allowed budget B. In the second phase, a node v max with the largest σ (G, S, v max ) and the cost for blocking v max which less than B are considered. Then the final outcome of A is compared to v max to obtain the best answer.
It is easy to see that in the worst case, algorithm IGA can take up to k 2 loop to re-calculate σ (G, S, A), where k is the number of activated turns on q topics. However, calculating the exact number of activated turns is #P-hard. To solve this problem, we use Monte Carlo (MC) simulation method to estimate target function (Algorithm 2). With each set S i , i = 1, 2, . . . , q, we use MC simulation T times to simulate the process of random information spreading. Each time, the number of activated turns by topic i is calculated, then the average number per T simulation times is calculated. Finally, we get the average number of activated turns on q topics. The larger the number of simulations T is, the higher the estimation accurate is.

Algorithm 2 Algorithm to Estimate the Value of the Function
Simulating the misinformation propagation process from the source S i on graph G i ; 4. N i ← the number of nodes activated after the propagation has finished; 5. count ← count + N i ; 6. end 7. return count/T . However, because calculating σ (G, S, A) is #P-hard, it is difficult to determine the number of simulations. In this case we perform T times of MC simulation, the time complexity of IGA is O(TRn 2 ) where R is the time complexity of a MC simulation. It means that IGA cannot run on networks with even small size. For this reason, in the following subsection, we develop a more practical algorithm called GEA that can work on large networks.

B. GREEDY EXTENSION ALGORITHM-GEA
In this subsection we propose an expanded version of the greedy algorithm IGA, called Greedy Extension Algorithm (GEA). The algorithm GEA is based on the idea of calculating average value of denominator and fast updating the target function σ (·). To do so, a tree structure is used to estimate and update σ (.) in each loop of the algorithm. We construct q graphs G i , i = 1, 2, . . . , q, according to MT-LT model and the result of lemma 1. Because the source set S i could have more than one node, neighboring nodes of S i could be infected by nodes of the same topic. In order to update target function conveniently and ensure the spreading properties of the model LT, we merge source nodes S i on 78884 VOLUME 8, 2020

Algorithm 3 Algorithm of Merging Vertices
if there exits edge (x, v) then 5. if H i ∈ G i then 6. Add edge (H i , v)  ; Blocking (x, v) from S i ; 11. end 12. end 13. Blocking all node S i from G i ; 14. Return G i , H i . graph G i into a node H i and obtain graph G i (Algorithm 3). Lemma 3 shows that two expressions before and after converting are equivalent.
Lemma 3: Algorithm 3 shows that any expression where H i is the unified source node of the nodes in S i .
Proof: To prove this lemma, we prove that the function σ (G, S, A) on the two expressions is the same. Assume that v is a node adjacent to the set S. When S has a single node u, it is obvious that the influence from S to v is w(u, v).p u i = w(H i , v). When the set S has k source nodes u 1 , u 2 , . . . , u k , the effect of k nodes on node v is: , this satisfies the spreading property on the LT model. When S has k + 1 source nodes, w(H i , v) is the effect after mixing k nodes. Mixing more k + 1 source node, total effect w( , which is the effect from H i to v according to Algorithm 3. According to the inductive proposition, we have the proof. For each graph G i after source nodes being merged, we use Monte Carlo simulation to create n i sample graphs g from G i using the online edge model [10]. Because we can access nodes from trees with roots H i , we only retain trees that can access to other nodes from the root node for the sample graph. This reduces a significant number of pointless sample graphs, helping to update values to get closer approximations. From n i samples, we set T i (i = 1..q) containing the root of n i trees.
We observe that f (T i , u) = |{v|v ∈ subtree(u)}| and we can compute f (T i , u) for all nodes u ∈ T i using the deep first search in algorithm 4. Since the limited budget B is used and a node may be present on many different trees, we apply sample average approximation to calculate σ (G, S, A) on q topics as follows: In Algorithm 5, for the first selection, we select set A 1 in the loop (from line 9 to line 22) by gradually adding node u into the set A 1 in a greedy manner, such that δ(u) reaches the maximum value (line 12). For the second selection, we select node u max so that c(u max ) ≤ B and the sample average approximation σ (u max ) is maximized (line 23). Let A 1 be the current solution in the loop t, we estimate the objective function which will increase gradually by blocking node v according to the following equation: After selecting u into H i , we conduct a calculations of all trees T i ∈ T i on all topics to block node u from trees T i and update f (T i , u) on T i ∈ T i , (lines 17-19) as follows: 1) if v is a descendant of u, we can block them because it is not reachable from The updating process is illustrated in the Fig. 2. We can calculate f (T i , H i ) = 10, f (T i , v 3 ) = 6, after blocking v 3 , and update f (T i {u}, H i ) = 10 − 6 = 4. Finally, the algorithm returns a better solution in two candidate solutions u max and A 1 by comparingσ (u max ) andσ (A 1 ). VOLUME 8, 2020 Algorithm 5 Greedy Extension Algorithm (GEA) Generate n i sample graphs by live-edge model [10] and create a set T i contains n i trees; 6. For each T i ∈ T i , calculate σ (T i , u) for all u ∈ T H i by Algorithm 4; 7. end 8. u max ← arg max v∈V , c(v)≤Bσ (v); 9. repeat 10. c min ← arg min v∈V c(v); 11. If c min + c(A 1 ) > B then break; 12. u ← arg max v∈V δ(A 1 , v) (Eq. 6); 13. U ← U \ {u}; 14. if c(A 1 ) + c(u) ≤ B then 15. 16. for i = 1 to q do 17. foreach T i ∈ T i do 18. If u ∈ T i then block node u and update

V. EXPERIMENTS RESULTS
In this section, we conduct experiments to show the efficiency of the proposed algorithms IGA and GEA. The proposed algorithms are compared with Degree and Random algorithms on the same setting of the MT-LT model.

A. EXPERIMENT SETTINGS
Datasets and parameter settings: The experiments are performed on 03 datasets, Grutela [39], Epinions [40] and NetHepPh [41], of the actual networks with size of up to tens of thousands of nodes and hundreds of thousands of edges, collected from the source [http://snap.stanford.edu/data/]. Some statistics of the datasets are provided in Table 2.
All the algorithms are programmed in Python language. All the experiments are conducted on a computer with CPU Intel Core i7 -8550U 1.8Ghz, RAM 8GB DDR4 2400MHz, running on Linux operating system.
Because it is hard to determine the exact impact weight of node u to v, according to previous researches [5], [16], [36], we set the weight of each edge (u, v) as w(u, v) = 1/|N in (v)|. It means that each edge has the same contribution in the activation of a node v, that is u∈N in (v)) w(u, v) = 1. The proposed algorithms IGA and GEA are compared with Degree and Random algorithms. For algorithm IGA, 10,000 MC simulations are performed to estimate the outcome of target function σ (·). The algorithm GEA (Algorithm 5) is quickly updated with target function value based on depth traversal using tree structure and approximate average denominator on all trees T i . The algorithm Degree selects all nodes with the highest ranks and adds them to the set of blocked nodes until the total cost for selecting nodes is greater than B, and the algorithm Random selects random nodes within the limited budget B.

B. RESULT 1) EVALUATING ALGORITHMS' EFFICIENCY IN UNIT-COST SETTING
To learn the efficiency of the proposed algorithms, we first conduct some experiments under unit-cost setting. That is, all costs for blocking a node c(u) are 1 for all datasets. The efficiency is measured based on the average outcomes of the diffusion function σ (G, S, A) of the formula 4. Fig. 3a, 3b, 3c show the results of all algorithms. When the budget increases, the number of average activated turns increases as well. As we can see, under unit-cost setting, GEA has the best efficiency, followed by IGA and both algorithms outperform Random and Degree with a large margin. In Fig. 3c, we must stop IGA early at budget larger than 40 because this algorithm takes a lot of time.

2) EVALUATING ALGORITHMS' EFFICIENCY IN GENERAL-COST SETTING
In this experiment, we compare the algorithms with budget B changing from 0 to 100 and cost of nodes c(u) is evenly distributed within the range [1.0, 3.0]. As can be seen in Fig. 3d, 3e, 3f, both algorithms GEA and IGA outperform Random and Degree algorithms. Algorithm GEA is 1.1 to 2.24 times more efficient than algorithm IGA and up to 121 times more efficient than algorithm Degree in term of the average number of activated turns. The reason is that Degree only uses social network topology attributes but cannot consider the impact process of the source nodes. We stop IGA early at budget larger than 40 on the Epinions network dataset because this algorithm takes a lot of time (longer than 72 hours).

3) COMPARING RUNNING TIME
Finally, we compare the algorithms in running time. Fig. 4a, 4e, 4f and Fig. 4d, 4e, 4f show running time of algorithms on 3 datasets. The running time increases as the budget increases. Random algorithm and Heuristic algorithms are very fast thanks to their simple calculation. Greedy and Random algorithms can run very fast even on big networks. However, GEA algorithm also achieves very competitive running time. The reason is the efficient grouping technique and tree calculation. In all settings, GEA runs faster than IGA up to 196 times. IGA is the slowest algorithm because of the time-consuming in MC simulations.

VI. CONCLUSION
In this paper, we introduce the problem of misinformation blocking with multiple topics spreading on social networks with limited budget. We model the problem as a combination optimization problem based on the LT model with additional requirements of multi-topic and fixed budget for node selection. We propose the MT-LT model to describe the process of multi-topic information spreading by extending the LT model. In this model, information spread is modelled based on different degrees of influence and activation thresholds for each topic. The MMTB problem is formulated by the MT-LT setting. We show that the MMTB problem is NP-hard, the calculation of the objective function is #P-hard and the objective function is monotone and submodular. Based on the monotone and submodular properties of the objective function, we propose an improved greedy algorithm called IGA with approximation ratio (1 − 1/ √ e). Next, we propose an extended algorithm called GEA based on the MT-LT setting by applying a top aggregation method, calculating the sample average and quickly updating the objective function to speed up the algorithm. For those reasons, the proposed algorithm GEA can be applied to medium and large online social networks. CANH V. PHAM received the Ph.D. degree in computer science from the Vietnam National University Hanoi, (VNU). He is currently a Postdoctoral Researcher with the ORLab, Faculty of Computer Science, Phenikaa University. His research interests include information diffusion problems in social networks, combinatorial optimization, and approximation algorithms.
ANH V. NGUYEN received the Ph.D. degree from Kyoto University, in 2012. He has been working as a Senior Researcher with the Institute of Information Technology, Vietnam Academy of Science and Technology. His research interests include machine learning, graph mining, and social network analysis. VOLUME 8, 2020