Approximation Algorithm for the Minimum Hub Cover Set Problem

A subset S ⊆ V of vertices of an undirected graph G = ( V , E ) is a hub cover when for each edge ( u , v ) ∈ E , at least one of its endpoints belongs to S , or there exists a vertex r ∈ S that is a neighbor of both u and v . The problem of computing a minimum hub cover set in arbitrary graphs is NP-hard. This problem has applications for indexing large databases. This paper proposes (cid:57) -MHC, the ﬁrst approximation algorithm for the minimum hub cover set in arbitrary graphs to the best of our knowledge. The approximation ratio of this algorithm is ln µ , where µ is upper bounded by min { 12 ( (cid:49) + 1) 2 , | E |} and (cid:49) is the degree of G . The execution time of (cid:57) -MHC is O (( (cid:49) + 1) | E | + | S || V | ). Experimental results show that (cid:57) -MHC far outperforms the theoretical approximation ratio for the input graph instances.


I. INTRODUCTION
Let G = (V , E) be an undirected graph. A subset S ⊆ V is a hub cover if for each edge (u, v) ∈ E any of the following conditions hold.
Computing a hub cover set has important applications in diverse areas, including query processing of large databases [1], [2], and graph isomorphism [3]. The smaller the cardinality of a hub cover set, the faster the query search in large databases [2]. The minimum hub cover set problem is computing the smallest cardinality hub cover set for a given undirected graph. Given an integer k, the hub cover set problem's decision version consists of finding a hub cover set whose cardinality is at most k. Yelbay et al. proved that the minimum hub cover set problem is NP-hard in trianglefree graphs (graphs containing no cliques of size three). We extended this result to the case of graphs of girth 3 (graphs containing triangles).
The associate editor coordinating the review of this manuscript and approving it for publication was Bilal Alatas .
Notice the similarity between computing a minimum hub cover set and computing a minimum vertex cover set, a wellknown NP-hard problem [4]. Recall that for a given undirected graph G = (V , E), a vertex cover is a set C ⊆ V such that for each edge (u, v) ∈ E, at least one vertex u or v belongs to C. Computing a hub cover set in triangle-free graphs is equivalent to computing a vertex cover set in such graphs.
There exist several algorithms dealing with the minimum vertex cover problem. Recently, Wang et al. [5] proposed an exact algorithm based on branch-and-bound that solves this problem in arbitrary graphs. Although they do not present an analysis of the algorithm's execution time, their algorithm is time-consuming for graphs of small size (< 1000 vertices). Computing a vertex cover helps to solve some NP-hard problems in graphs. For instance, Cygan and Pilipczuc [6] used the best exact algorithm for the vertex cover to design an FPT exact exponential-time algorithm for a problem similar to the vertex cover.
Approximation algorithms [7] are polynomial-time algorithms that produce solutions approximate to the optimal in a fraction of the time required by an exact algorithm. Some interesting graph problems use approximation algorithms to deal with NP-hard problems in graphs. For the case of VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the vertex cover, some heuristics exist to compute approximate solutions. Bhattacharya et al. [8] designed a (2 + )approximate vertex cover in O(log 3 n) time. Karakostas [9] provided a (2 − o(1))-approximate minimum vertex cover for an arbitrary graph. Dinur et al. [10] proved that the problem of designing a polynomial-time approximation algorithm for the minimum vertex cover within a factor smaller than 1.3606 is NP-hard. Later, Khot and Regev [11] showed that approximating the minimum vertex cover with a factor of (2 − ε) is likely NP-hard for a constant ε.
There are only a few algorithms related to the minimum hub cover set problem. Yelbay et al. [12] proposed an approximation algorithm for this problem on planar graphs. Their algorithm transforms the planar graph G into a λ-outerplanar graph G . Then, it decomposes G into a set of small outerplanar subgraphs. After that, it uses a solver to compute a minimum hub cover set into each subgraph of G . Finally, it merges the solutions. The approximation factor of their algorithm is λ+1 λ . Yelbay's Ph.D. dissertation [13] provides further information about the minimum hub cover set. In particular, Yelbay also proved that computing a minimum hub cover set on triangle-free graphs is an NP-hard problem. In the present paper, we demonstrated that this problem is also NP-complete for graphs of girth 3.
This paper proposes -MHC, the first approximation algorithm for the minimum hub cover set in arbitrary graphs. This algorithm does not use a solver to execute subroutines. We organized the rest of the paper as follows. Section II presents the notation and some concepts related to -MHC. Section III describes -MHC. Section IV proves the correctness of -MHC and analyzes its execution time. Section V proves that the minimum hub cover set problem on arbitrary graphs is NP-complete. Section VI describes the experimental design. Section VII analyses and discusses the experimental results. Finally, Section VIII presents some concluding remarks.

II. TERMINOLOGY AND NOTATION
The degree of any vertex v ∈ V , denoted by δ(v), consists of the cardinality of its neighborhood. The degree of G, denoted by , is the maximum value of δ(v) be the set of edges that vertex v can cover; specifically,

III. APPROXIMATION ALGORITHM FOR THE MINIMUM HUB COVER SET
This section describes -MHC. This algorithm receives as input an arbitrary undirected graph G = (V , E) and returns a hub cover set S ⊆ V . -MHC follows a greedy strategy, so it includes in S the vertex that can cover the greatest number of uncovered edges in each iteration. After including a vertex in S, the algorithm marks the edges covered by such vertex as cover.
The algorithm uses a max-priority queue Q, based on a max-heap, to efficiently manage the number of uncovered edges in each set F(v) for all v. We use the following variables during the description of the algorithm. We assume that each vertex has a unique identifier from the set {1, . . . , |V |}. Some of the variables have additional information (named satellite data). We use a superscript to refer to the satellite data and emphasize this information in some parts of the algorithm.
• The set S stores the vertices of the computed hub cover set.
• The set F(v) contains all the edges vertex v can cover.
• The set R(e) contains all the vertices that can cover edge e. Each vertex v stored in R(e) has edge e as satellite data, denoting it by v +e .
• The Boolean variable cover(e) is true if some vertex covers the edge e; otherwise, this variable is false.
• The list Adj(v) stores the set of vertices adjacent to vertex v.
• The array E τ stores the number of uncovered edges of the set F(v), for all v ∈ V , at the end of iteration τ .
• The array L v stores, in a non-decreasing way, the set of vertices adjacent to vertex v.
• The array L stores the output of algorithm MERGE.
• The arrays T and T store vertices. Each vertex is attached to an edge as satellite data.
• The array U stores Boolean values. The value of U (v) is true if during the current iteration, at least one edge in F(v) was covered; otherwise, its value is false.
• The expression +e(v) indicates the satellite-data edge associated to vertex v.
The following functions and algorithms facilitate the description of -MHC. • The algorithm MERGE takes as input two sorted arrays and returns a single sorted array in O(|M |) time, where M is the size of the output array.
• The function EXTRACT-MAX extracts and removes the element with the highest priority in Q in O(log |V |) time.
• The function BUILD-MAX-HEAP constructs a max-heap Q with the elements of array E τ in O(|V |) time.
• The function DECREASE-KEY decreases in Q the priority of an element i to the value E τ (i) in O(log |V |) time.

Algorithm 1 Pseudocode of INIT(G)
Require: An undirected graph G = (V , E). Ensure: Initializes most of the variables of the set -MHC, computes the set F(u) for each u ∈ V , and creates the max-priority queue Q. 10: cover(e) ← false 11: R(e) ← {x +e ∪ y +e }

12:
L ←MERGE(L x , L y ) 13: for i ← 1 to |L| − 1 do Algorithm 1 presents the pseudocode of INIT. Lines 1-8 initialize most of the variables of the algorithm. The cycle of Lines 9-18 is the most time-consuming part of INIT. This cycle aims to compute the set R(e), the vertices covering edge e, for all e ∈ E. Then, Lines 19-23 generate the set F(v), the set that vertex v can cover, for each v ∈ V . Notice that e ∈ F(v) if and only if v ∈ R(e). Finally, Lines 24-27, construct the max-priority queue Q. This queue stores all the vertices of V that are not in S. The vertex with the highest priority in Q is the one that can cover the greatest number of uncovered edges. Algorithm 2 presents the pseudocode of -MCH. This algorithm aims to cover all the edges of E by systematically adding vertices to the hub cover set S. Each iteration of the while cycle of Lines 2-21 proceeds as follows.
First, since this algorithm follows a greedy approach, Lines 3-5, select the vertex v that covers the greatest number of uncovered edges and adds it to S. After that, the cycle of Lines 10-19 marks as 'covered' each uncovered edge e in the set F(v). Then, for each new covered edge e, the cycle also marks e in the set F(u), where u is a vertex that, like v, also covers e. Finally, Line 20 incorporates the changes produced during the previous cycle in Q. -MHC ends when the set E is empty; i.e., when the vertices in S cover all the edges in E.

IV. CORRECTNESS AND ANALYSIS OF -MHC
This section presents the correctness and complexity proofs of -MHC. First, Lemma 1 proves that -MHC correctly computes a hub cover set S. Then, Lemma 3 shows that the approximation ratio of this algorithm is ln µ, where µ represents the cardinality of the largest set F(v) for all v ∈ V . Finally, Lemma 6 establishes that the execution time of -MHC is O(( + 1)|E| + |S||V |). VOLUME 10, 2022 BUILD-MAX-HEAP(Q, E τ ) 9: else 10: for i ← 1 to |V | do 11: if U (i) = true then Proof: By contradiction. Assume that after executing -MHC on G, the generated set S is not a hub cover set. This assumption implies that at least one edge (x, y) ∈ E is not covered. Notice that Line 13 of the algorithm only removes from E an edge if some vertex of S has covered it; so -MHC can not end if E = ∅; therefore, there exists a contradiction, and S is a hub cover set.
The cost of the solution given by -MHC is |S|; i.e., one unit per vertex added to S. When -MHC adds a vertex v i to S at iteration i, v i covers a set of edges A i ⊆ E for the first time; i.e., A i is the subset of uncovered edges of F(v i ) just before the iteration i. Eq. 1 gives an expression for A i .
) is the number of edges in F(v) that -MHC covered at iteration i. Then, Eq. 3 establishes the cost to cover all the edges in F(v). e∈F(v) therefore, Lemma 3: Let G = (V , E) be an undirected graph. Let S * be a minimum hub cover set in G, and let S be the hub cover set computed by -MHC. Then, |S| ≤ H µ |S * |, where µ is the cardinality of the largest set F(v) for all v ∈ V .
Proof: Let v i be the vertex added to S at iteration i. Notice that Eq. 6 computes the value of E i−1 (v i ) at the end of iteration i − 1.
Eq. 7 computes the cost to cover all the edges of F(v i ) at iteration i. consequently, Since the vertices of S * cover each edge e ∈ E at least once, then Obtain Eq. 10 by substituting Eq. 8 in Eq 9, obtain Eq. 11 by substituting Eq. 5 in Eq 10, finally, where, µ is the cardinality of the largest set F(v) for all v ∈ S * . Lemma 4 proves that the value of µ is bounded by min{ 1 2 ( + 1) 2 , |E|}. Lemma 4: Let µ be the cardinality of the largest set F(v) for all v ∈ V . Then, µ ≤ min{ 1 2 ( + 1) 2 , |E|)}. Proof: Let δ(v) be the degree of an arbitrary vertex v ∈ V . Notice that the number of crossing edges Therefore, µ ≤ min{ 1 2 ( + 1) 2 , |E|}. For the remaining portion of -MHC, the procedure to cover all edges of E also takes ( + 1)|E| time. Notice that Line 15 decreases in one unit the value of E τ (u), and the algorithm only can end when all variables E τ (u) for all u ∈ V are equal to zero. UPDATE-KEY (Line 20) requires O(|V |) time per iteration, and the algorithm performs |S| iterations.

V. NP-HARDNESS OF THE MINIMUM HUB COVER SET PROBLEM
Yelbay [13] proved that the minimum hub cover set problem is NP-hard for triangle-free graphs. This section proves that the decision version of this problem is NP-complete, even if the input graph has triangles. The proof's idea consists of reducing the 3-SAT problem to the decision version of the hub cover set problem. Let φ be a Boolean formula in conjunctive normal form with n Boolean variables and m clauses (containing exactly three different Boolean variables). The 3-satisfiability problem (3-SAT) determines if an assignment exists that makes φ true. The 3-SAT problem is NP-complete [14].
Lemma 7: The decision version of the hub cover set problem is in the NP class.
Proof: Let G = (V , E) be an undirected graph and let k be a positive integer. Let the pair G, k be an instance of the decision hub cover set problem. Assume that S ∈ V is a certificate for such an instance. It is possible to use a modification of -MHC to verify this certificate. In this algorithm, instead of extracting vertices from Q, extract them from S and verify that every edge is covered. Such an algorithm runs in polynomial time, as shown in Section IV. Therefore, the decision hub cover set problem is in the NP class.  1 ∨ b j,2 ∨ b j,3 ), b j,i ∈ U for 1 ≤ i ≤ 3 and 1 ≤ j ≤ m. Each variable b j,i can be non-negated or negated (denoted byb j,i ) in c j . We present an algorithm to reduce φ to an instance of a hub cover set of size k = n + 2m.
Notice that the time required to transform the 3-SAT instance φ into an instance G φ of a hub cover set of size k is polynomial in n,m since |V U | = 2n, |V D | = 4m, |E U | = n, |E D | = 3m, and |E UD | = 4m. Now, we prove that a hub cover set of size exactly k = n + 2m exists in G φ if and only if φ is satisfiable.
First, assume that there exists a satisfiable assignment for φ. Let S U ∈ V U be such an assignment, where |S U | = n. We need to show that for each graph C c j ⊆ G φ , there exists VOLUME 10, 2022 a hub cover set containing exactly two vertices from the set {a j,1 , a j,2 , a j,2 , a j,3 } and at least one vertex from the set {b j,1 , b j,2 , b j,3 } (otherwise, φ is not satisfiable). There are three cases to consider: • Case 1. Only one vertex from the set {b j,1 , b j,2 , b j,3 } ∈ S U . There are two subcases: -Subcase 1. Only one vertex from the set {b j,1 , b j,3 } ∈ S U . Without loss of generality, assume that b j,1 ∈ S U . Then, both vertices {a j,2 , a j,3 } must also be in S U to cover all the edges in C c j . -Subcase 2. The vertex b j,2 ∈ S U . Then, both vertices {a j,1 , a j,3 } must also be in S U to cover all the edges in C c j .
• Case 2. Exactly two vertices from the set {b j,1 , b j,2 , b j,3 } belong to S U . There are two subcases: -Subcase 1. Both vertices {b j,1 , b j,3 } ∈ S U . Then, only one of the following three sets {a j,1 , a j,2 }, {a j,2 , a j,2 }, or {a j,2 , a j,3 } must also be in S U to cover all the edges in C c j . -Subcase 2. Vertex b j,2 ∈ S U and only one vertex from the set {b j,1 , b j,3 } is in S U . Without loss of generality, assume that b j,1 ∈ S U . Then, vertex a j, 3 and one from the set {a j,1 , a j,2 } must also be in S U to cover all the edges in C c j .
• Case 3. The three vertices from the set {b j,1 , b j,2 , b j,3 } belong to S U . Then, only one of the following three sets {a j,1 , a j,2 }, {a j,2 , a j,2 }, or {a j,2 , a j,3 } must also be in S U to cover all the edges in C c j . In Cases 1 to 3, each clause c j ∈ C contains exactly two vertices from the set {a j,1 , a j,2 , a j,2 , a j,3 }. Therefore, there exists a hub cover set of size exactly k = n + 2m in G φ . Now, assume that there exists a hub cover set S in G φ of size exactly k = n + 2m. We need to show that it is possible to obtain a satisfiable assignment from S. Assume by contradiction that it is not possible to obtain a satisfiable assignment from S.
Notice that the set S contains at least n vertices from V U ; otherwise, there exists one edge from E U that is not covered. By the first part of this proof, each subgraph C c j contains exactly two vertices from the set {a j,1 , a j,2 , a j,2 , a j,3 } in S.
Since φ is unsatisfiable, there exists at least one clause c g ∈ C that is unsatisfiable. Thus, no vertex from the set {b g,1 , b g,2 , b g,3 } belongs to S. Then, the hub cover set S requires to include at least three vertices from the set {a g,1 , a g,2 , a g,2 , a g,3 } to cover all the edges in C c j , generating a hub cover set of size n + 2m + 1, which contradicts the initial assumption. Therefore, the hub cover set S represents a satisfiable assignment to φ, and the 3-SAT can be polynomially reduced to the hub cover set of size k = n + 2m.
Theorem 1: Let G = (V , E) be an undirected graph and let k be a positive integer. The problem of computing a hub cover set of size at most k is in the NP-complete class.
Proof: The proof is a consequence of Lemmas 7 and 8.

VI. EXPERIMENTAL DESIGN
This section presents a set of numerical experiments that evaluates the quality of the minimum hub cover set generated by -MHC. Section VI-A presents a linear programming formulation of the MHC problem. Section VI-B describes the graph instances used by the experiments. Section VI-C presents the metrics to compare the quality of -MHC against Gurobi.

A. MHC LINEAR PROGRAMMING MODEL
Given an undirected graph G = (V , E), Eq. 13-15 [12] represents the integer linear programming formulation to compute the minimum hub cover set, ILP-MHC.
subject to (14) x The function in Eq. 13 minimizes the number of selected vertices in the hub cover set. In this equation, the binary variable x v is equal to one only if vertex v belongs to the hub cover set. The constraint in Eq. 14 ensures that every edge (u, v) ∈ E is covered by at least one vertex in the hub cover set. For (u, v) ∈ E, the set K(u, v) denotes all the vertices w ∈ V such that (u, w) ∈ E and (v, w) ∈ E; i.e., K(u, v) = R(u, v)\{u, v}. Finally, Eq. 15 restricts each variable x v to be binary.
We used Gurobi, a typical solver in the industry and academia, to optimally solve ILP-MHC. The design of Gurobi exploits multicore architectures to parallelize computations. It solves several types of problems like linear programming (LP), mixed-integer programming (MIP), and quadratic programming (QP). It uses advanced algorithms and heuristics like simplex, parallel barrier with crossover, concurrent and sifting, and cutting planes. However, similar to other problems in the NP-hard class, solving optimally large-scale MHC instances prove very difficult, even for a solver.

B. GRAPH INSTANCES
We generated and divided our set I of 40 graphs instances using the Erdős-Rényi model, into four subsets of ten graphs, denoted by I 2k , I 25k , I 50k , and I 100k .
The instances used are diverse enough to allow us to show the performance and quality of our algorithm. In particular, such instances include dense and sparse graphs whose order ranges from 10 to 1,000 nodes and size from 18 to 100,000 edges.
• Set I 2k . The graph instances in this set contain few vertices. The order varies from 10 to 100 in increments of 10, and the size from 18 to 2000. With these graphs, we aim to verify that the observed performance of -MHC, outperforms the approximation ratio. Gurobi could obtain the minimum hub cover set only for this set of graphs. Gap percentage between the result produced by -MHC and the best result from Gurobi within time limit. The symbol * indicates that the cardinality of the hub cover set obtained by -MHC is smaller than the one computed by Gurobi. • Sets I 25k , I 50k , and I 100k . Each of these sets contains graphs with vertices varying from 100 to 1000 in increments of 100. The number of edges varies in each set: between 18 and 2000 for I 25k , between 251 and 24999 for I 50k , and between 488 and 99810 for I 100k .
With these graphs, we aim to evaluate the performance of -MHC.
Let |S| G i and |G| G i be the cardinalities of the hub cover sets obtained in graph G i ∈ I by -MHC and Gurobi with no more than 10, 000 seconds, respectively. Let |S * | G i be the cardinality of the minimum hub cover set in G i . Notice that Gurobi computes the minimum hub cover set only if G i ∈ I 2k . In such case, |G| G i = |S * | G i ; for the remaining instances |G| G i ≥ |S * | G i .

C. METRICS
We use the following metrics to evaluate the performance of -MHC.
• The observed performance compares the theoretical upper bound of -MHC (proved in Lemma 3) against its empirical performance. It is worth mentioning that it is not ''feasible'' to obtain the cardinality of the minimum hub cover set in most of the graphs in I within the time limit. Then, we only compare the observed performance on the instances of I 2k . VOLUME 10, 2022 • The quality of the solution compares the cardinality of the hub cover set obtained by -MHC versus the one obtained by Gurobi.
• The execution time provides information about the time required by -MHC to compute a hub cover set against the one required by Gurobi.

D. COMPUTING ENVIRONMENT
We performed these experiments on an Ubuntu server Intel(R) Xeon(R) Gold 5222 CPU with 20 cores and 40GB RAM. We implemented ILP-MHC in Gurobi and executed it with only one core.

VII. ANALYSIS OF EXPERIMENTAL RESULTS
This section analyses and discusses the hub cover sets generated by -MHC on the instances described in section VI-B in three different aspects.
A. OBSERVED PERFORMANCE Fig. 3a depicts the cardinalities of the hub cover sets generated by GUROBI, -MHC, and the theoretical approximation ratio (|G| G i , |S| G i , and H µ |S * | G i , respectively) for each graph G i ∈ I. Recall that µ is bounded by min{ 1 2 ( + 1) 2 , |E|}, where is the degree of G i . Observe that |S| G i is very close to |S * | G i for each G i , and |S| G i is closer to |S * | G i than to H µ |S * | G i . Therefore, the empirical performance of -MHC far outperforms the approximation ratio for instances in I 2k . Fig. 3b presents the execution times required by -MHC and Gurobi to compute their corresponding hub cover sets for each G i . Notice that for the minor instances, both Gurobi and -MHC are similar; however, when the number of edges surpasses 1000, Gurobi requires several minutes to obtain the minimum hub cover set, whereas the execution time required by -MHC remains practically constant.

B. QUALITY OF THE SOLUTION
Gurobi could not obtain the minimum hub cover set for large instances. For such instances, we compared the solution generated by Gurobi with no more than 10,000 seconds versus the one generated by -MHC. Fig. 4a, 5a, and 6a plot the cardinality of the solution generated by -MHC and Gurobi for I 25k , I 50k , and I 100k , respectively. Table 1 provides detailed the gap percentage between the cardinality of the hub cover obtained by -MHC and the one computed by Gurobi within the time limit.
In general, the results show that the solution obtained by -MHC is quite close to the one of Gurobi, particularly for small and large instances; e.g., in some instances in I 2 and in I 100k , both Gurobi and -MHC obtained the same cardinality. Even more, in some large instances -MHC obtained a smaller hub cover than the one obtained by Gurobi within the time limit.
C. EXECUTION TIME OF THE ALGORITHMS Fig. 4b, 5b, and 6b plot the execution times required by both -MHC and Gurobi to compute hub cover sets. Observe that the cardinality obtained by Gurobi represents the hub cover set obtained after 10,000 seconds. In contrast, -MHC computes good hub cover sets in a fraction of the time required by Gurobi. The results in these plots show that the execution times of -MHC are much shorter than those of Gurobi, with reasonably good quality solutions, or even better, as depicted by Fig.6a.

VIII. CONCLUDING REMARKS
This paper proposes -MHC, the first approximation algorithm for the minimum hub cover set problem in arbitrary graphs. We show that the approximation ratio of -MHC is ln µ, where µ is bounded by min{ 1 2 ( + 1) 2 , |E|}. The time complexity of -MHC is O(( + 1)|E| + |S||V |).
We provided experimental results showing the observed performance of -MHC. Experimentation empirically shows that -MHC far outperforms the expected approximation ratio in a fraction of the time required by Gurobi to obtain approximate solutions. In addition, in some instances, -MHC is close to the best solution obtained by the solver. We seek to design better approximation bounds for computing the minimum hub cover set as future research work.