JacSim*: An Effective and Efficient Solution to the Pairwise Normalization Problem in SimRank

Despite the fact that SimRank has been successfully applied to various applications as a link-based similarity measure, it suffers from a counter-intuitive property called a <italic>pairwise normalization problem</italic>; JacSim is a powerful variant of SimRank that alleviates this problem. In this paper, we first point out three existing drawbacks of JacSim and then propose <italic>JacSim*</italic> to effectively <italic>solve</italic> them; JacSim* <italic>exploits</italic> those paths neglected by JacSim in similarity computation, its matrix form provides the <italic>exact</italic> similarity scores while <italic>not</italic> being sensitive to the number of node-pairs with common neighbors, and it has simpler, easier to understand, and easier to implement formulas in <italic>both</italic> iterative and matrix forms than those of JacSim. We conduct extensive experiments with <italic>eight</italic> real-world datasets to evaluate <italic>both</italic> the accuracy and performance of JacSim* in comparison with those of JacSim. Our experimental results demonstrate that JacSim* shows <italic>better</italic> accuracy than JacSim and the JacSim* matrix form is <italic>dramatically</italic> faster than its own iterative form and also than the two forms of JacSim with <italic>all</italic> datasets.


I. INTRODUCTION
In many domains such as social networks, citation networks, bio-medical drug molecules, and the World Wide Web, graphs are widely used to encode relational structures where nodes represent objects and links do their relationships in the domain [1]- [3]. In a wide range of applications such as recommender systems, spam detection, web page ranking, and social network analysis, computing accurate similarity among nodes based on the graph structure is a fundamental task [1], [3]. Toward this end, various link-based similarity measures (in short, similarity measures) such as SimRank [4] and its variants [1], [5]- [7] have been proposed in the literature. The philosophy behind SimRank is that two objects are similar if they are related to similar objects and any object is most similar to itself [4]. SimRank recursively computes the similarity between two nodes a and b as the average of similarity between all possible pairs of neighbors pointing to a and b (i.e., in-neighbors) where the similarity between a node and itself is defined as one (i.e., the base case of the recursion); it is called the pairwise normalization [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Sathish Kumar .
It is worth to note that to compute the similarity between two nodes, some existing similarity measures such as Struct-Sim [8] exploit the roles of nodes in the graph based on the automorphism equivalent property; however, SimRank and its variants compute the similarity score of a pair of nodes by exploiting their neighbors (i.e., in-link paths) regardless of their roles in the graph. The graph similarity learning methods compute the similarity between two graphs by applying learning techniques (e.g., graph embedding methods) [9]- [11], while similarity measures compute the similarity between two nodes in a single graph. Graph embedding methods exploit the graph structure to represent each node in the graph as a low-dimensional vector [12], [13], and then the similarity of two nodes can be computed by applying vectorbased measures (e.g., Cosine and Euclidean distance [14]) to their corresponding vectors [8], [15]; on the contrary, similarity measures directly exploit the graph structure to compute the similarity of nodes. It has been shown that similarity measures are better than the graph embedding methods to compute nodes similarity [8], [15].
Although SimRank has been successfully applied to many applications such as clustering [16], citation analysis [17], [18], query rewriting [19], k-nearest neighbor search [20]- [22], and link prediction [23], it suffers from a counter-intuitive property raised by the pairwise normalization where a more number of common in-neighbors may adversely affect the similarity score of a pair of nodes [1], [5], [6], [24]; it is called the pairwise normalization problem [1]. In the literature, different variants of SimRank have been proposed to alleviate this problem. MatchSim [5] exploits only the pairs of similar (i.e., matched) in-neighbors instead of considering all possible pairs of in-neighbors. PSim-Rank [24], C-Rank [6], and JacSim [1] employ Jaccard coefficient (i.e., Jaccard) [14] along with the pairwise normalization to address the problem; PSimRank and C-Rank behave closely but they apply different normalization techniques. On contrary to PSimRank and C-Rank, JacSim avoids redundancy in computation and assigns an importance factor to the two scores computed based on Jaccard and the pairwise normalization; it has been shown that JacSim significantly outperforms SimRank, PSimRank, C-Rank, and MatchSim in terms of both accuracy and performance (i.e., execution time) [1].
As such, JacSim is an excellent variant of SimRank that successfully alleviates the pairwise normalization problem. In this paper, we first point out the following drawbacks of JacSim: 1) JacSim does not exploit some paths in the graph, which incurs limitations in accurate similarity computation; 2) the JacSim matrix form provides approximate similarity scores, thereby providing lower accuracy than that of its iterative form; 3) although the JacSim matrix form is significantly faster than its iterative form, it is still slow even with small graphs if the graph contains a large number of node-pairs with common in-neighbors. Then, we propose JacSim* that not only effectively solves the above three drawbacks but also preserves the JacSim philosophy in similarity computation to solve the pairwise normalization problem. JacSim* exploits those paths neglected by JacSim in similarity computation, its matrix form provides the exact similarity scores and identical accuracy to that of the iterative form, and the JacSim* matrix form is composed of only matrix-based operations, thereby not being sensitive to the number of node-pairs with common in-neighbors. We conduct extensive experiments with eight real-world datasets to evaluate both the accuracy and performance of our JacSim* in comparison with those of JacSim. Our experimental results with all datasets demonstrate that JacSim* improves the accuracy of JacSim and the JacSim* matrix form is dramatically faster than its own iterative form and also than the both forms of JacSim.
Our contributions in this paper are summarized as follows: • We propose JacSim* that exploits the paths neglected by JacSim in similarity computation while solving the pairwise normalization problem.
• We propose a matrix form for JacSim*, which is dramatically faster than its own iterative form and the both forms of JacSim while providing the exact similarity scores and not being sensitive to the number of nodepairs with common in-neighbors in the graph. • JacSim* has formulas in both iterative and matrix forms simpler, easier to understand, and easier to implement than JacSim.
• We conduct extensive experiments with eight real-world datasets to validate the accuracy and performance of our JacSim* in comparison with JacSim. The remain of this paper is organized as follows.
In Section II, we provide some preliminaries about the pairwise normalization problem, JacSim, and its drawbacks. In Section III, we present our proposed similarity measure, JacSim*, in details. Section IV explains the experimental settings and analyzes the results of our experiments. In Section V, we conclude and summarize the paper.

II. PRELIMINARIES
In this section, we provide brief explanations of the pairwise normalization problem, JacSim, and its drawbacks.

A. PAIRWISE NORMALIZATION PROBLEM
In spite of the current success of SimRank, it suffers from the pairwise normalization problem, which is a counter-intuitive property of the pairwise normalization where more number of common in-neighbors may adversely affect the similarity score of a pair of nodes [1], [5], [6], [24]. Consider the sample graph in Fig. 1; nodes h and i have one common direct in-neighbor (i.e., c), while nodes k and l have two of them (i.e., f and g). Therefore, the similarity score of nodepair (k, l) should be intuitively higher than that of node-pair (h, i); however, SimRank assigns a lower similarity score to (k, l) (i.e., 0.16) than that of (h, i) (i.e., 0.20). The same circumstance is observed for node-pairs (m, n) and (p, q) each of which do not have any common direct in-neighbors; m and n have one common indirect in-neighbor (i.e., c), while p and q have two of them (i.e., f and g). It means the similarity score of (p, q) should be higher than that of (m, n); however, the SimRank score of the former node-pair (i.e., 0.042) is lower than that of the latter one (i.e., 0.053). 1

B. JACSIM
JacSim [1] is a powerful variant of SimRank that alleviates the pairwise normalization problem by employing both Jaccard and the pairwise normalization. Suppose that G = (V , E) is an unweighted and directed 2 graph where V is a set of nodes, E ∈ V×V is a set of links among nodes, and I a denotes a set of nodes directly pointing to node a (i.e., direct in-neighbors); JacSim computes the similarity score of a node-pair (a, b), S(a, b), as follows. If a = b, then S(a, b) = 1; if a = b and I a = ∅ or I b = ∅, then S(a, b) = 0; otherwise S(a, b) is calculated by the following recursive formula: where the left and right sides of + operator in the main parenthesis computed by Jaccard and the pairwise normalization are referred to as the J-score and P-score, respectively. α ∈ (0, 1] is an important factor to control the degree of importance of these two scores in similarity computation and C ∈ (0, 1) is a damping factor. Consider our sample graph in Fig. 1 again. As explained before, the SimRank score of node-pair (h, i) with one common direct in-neighbor is higher than that of node-pair (k, l) with two common direct in-neighbors due to the pairwise normalization problem; however, JacSim assigns higher similarity score to (k, l) (i.e., 0.128) than that of (h, i) (i.e., 0.080). On the contrary to SimRank, JacSim also assigns higher similarity score to (p, q) (i.e., 0.0204) than that of (m, n) (i.e., 0.0128). 3

C. JACSIM DRAWBACKS
Now, let us point out the existing drawbacks of JacSim as follows.
D1: As observed in (1), to compute the similarity scores for any node-pairs (a, b), JacSim neglects all in-neighbor pairs (i, j) where i and j pointing to both a and b (i.e., i, j ∈ I a ∩ I b ) in calculating the P-score since (I a − I b ) ∩ I b = ∅ in i∈I a −I b j∈I b S(i, j) part, and (I b − I a ) ∩ (I a ∩ I b ) = ∅ in i∈I b −I a j∈I a ∩I b S(i, j) part. As an example, consider nodes i and j in the sample graph in Fig. 1. To compute S(i, j), all node-pairs (d, d), (d, e), (e, d), and (e, e) are neglected in calculating the P-score (i.e., I i ∩ I j = {d, e}) where nodepairs (d, d) and (e, e) are ignored to alleviate the pairwise normalization problem (i.e., the similarity score based on the common direct in-neighbors are computed by Jaccard); however, by ignoring node-pairs (d, e) and (e, d), the participation of node a, the common indirect in-neighbor pointing to both i and j via nodes d and e, is not regarded in computing similarity between i and j. More specifically, JacSim does not exploit part of paths in the graph in similarity computation.
D2: The JacSim iterative form represented in (1) cannot be directly transformed to a matrix form for the 2 For the sake of generality, we regard G as a directed graph since an undirected graph G can be considered as a directed one where each single link in G is represented by two links each of which in a different direction. 3 In (1), C and α are set as 0.8 and 0.4 by following [1].
following reason. 4 To calculate the P-score, JacSim partially employs the pairwise normalization on I a and I b (i.e., as explained in D1, all in-neighbor pairs (i, j) where i, j ∈ I a ∩ I b are ignored and normalization is performed by using the value of |I a ||I b |−|I a ∩ I b | 2 ). Therefore, in order to transform the JacSim iterative form to a matrix form, (1) is slightly modified such that the P-score is normalized by the value of |I a ||I b | instead of |I a ||I b |−|I a ∩ I b | 2 . As a result, the JacSim matrix form does not provide the exact JacSim scores (i.e., the approximate scores are computed) and its accuracy is lower than that of the iterative form as shown in [1].

D3:
The similarity score of a node-pair (a, b) is computed by the JacSim matrix form as follows: where S, J , Q, E, I ∈ R |V |×|V | , S is a similarity matrix whose entry [S] a,b denotes the similarity score of node-pair (a, b), entry [J ] a,b of matrix J denotes the J-score of (a, b), Q is a column normalized adjacency matrix whose entry is a transpose matrix of Q, I is an identity matrix, and E is a matrix whose entry [E] a,b denotes the summation of JacSim scores of all in-neighbor pairs (i, j) where i, j ∈ I a ∩ I b normalized by the value of |I a ||I b |. It has been shown that the JacSim matrix form is significantly faster than its iterative form [1].
To accelerate the matrix multiplications in (2), matrices Q and S are represented by the compressed sparse column (CSC) storage schema [25]; however, the time complexity of the JacSim matrix form is dominated by computing matrix E. Let denotes the number of node-pairs (a, b) with common in-neighbors (i.e., = |{(a, b)|I a ∩ I b = ∅}|), d does the average number of nodes in I a ∩ I b for all node-pairs (a, b), and k be the number of predefined iterations to compute the similarity scores. As observed in (2), matrix E is computed on each iteration; the time complexity for calculating the entries in E is O(k d 2 ), which could be O(k|V | 4 ) in the worst case; this computation is slow when the graph contains a large number of node-pairs with common in-neighbors. More specifically, the execution time of the JacSim matrix form with a small graph having a large number of nodepairs with common in-neighbors is longer than that with a large graph having a small number of such node-pairs. Furthermore, matrix J (i.e., the J-scores) is computed by the conventional approach (i.e., ''for'' loop), which is expensive with large graphs; the JacSim matrix form is not computed by only matrix-based operations since both matrices E and J are computed by employing ''for'' loops. 4 The complete process of transforming the JacSim iterative form to the matrix form can be found in [1, Section 4].

III. PROPOSED MEASURE: JACSIM*
In this section, we present JacSim* that effectively solves the existing drawbacks of the original JacSim.

A. ITERATIVE FORM
As explained in Section II, JacSim does not exploit some paths in the graph to compute the similarity score of any nodepair (a, b). In order to solve this issue, we propose a novel random walk model as follows: two random walkers r a and r b traverse the graph backward via in-links (i.e., incoming links to nodes) by starting at a and b (i.e., a = b), and the two nodes are regarded similar if r a and r b meet up at a common direct or indirect in-neighbors of a and b; however, the random walkers are supposed to meet up at common direct in-neighbors (i.e., r a visits i and r b visits j, i = j) with the highest probability (i.e., 1) where the similarity is computed by Jaccard (i.e., the J-score) or they traverse the graph to meet up at common indirect in-neighbors of a and b where the similarity is computed by the pairwise normalization (i.e., the P-score). More specifically, in computing the P-score, we exploit all in-neighbor pairs (i, j) where i = j (i.e., instead of neglecting JacSim* computes the similarity score of a node-pair (a, b) as follows.
that is obtained by the following recursive formula: where the base case of the recursion is S (a, b) = 0 if a = b; this base case guarantees that only those in-neighbor pairs (i, j) where i = j are considered in calculating the P-score. Fig. 2 illustrates simplified versions of both JacSim* and JacSim recursive computations to calculate the similarity score of node-pair (i, j) in our sample graph from Fig. 1; in both cases, the circled numbers denote the required recursive calls (i.e., in order of their executions) to compute the P-score of (i, j). JacSim* and JacSim employ seven and five recursive calls to calculate the P-score, respectively; JacSim* exploits two node-pairs (d, e) and (e, d) neglected by JacSim and considers the contribution of node a (i.e., the common indirect in-neighbor of i and j) in similarity computation, thereby assigning the higher similarity score (i.e., 0.1984) to (i, j) than the one JacSim does (i.e., 0.1600). In the case of (h, i), (k, l), (m, n), and (p, q), the similarity scores computed by JacSim* are identical to the ones computed by JacSim in our sample graph 5 ; JacSim* effectively solves the pairwise normalization problem by preserving the JacSim philosophy in similarity computation.
The recursive computation in (3) can be solved by an iteration to a fixed-point for k = 1, 2, . . . over S (a, b) as follows. If a = b, then S k (a, b) = 1 for any k; if a = b and I a = ∅ or I b = ∅, then S k (a, b) = 0 for any k; otherwise S k (a, b) 5 In both similarity measures, the values of C and α are set as 0.8 and 0.4, respectively, by following [1]. is computed by S k (a, b): where S k (a, b) = 0 if a = b; the iterative computation starts with S 0 (a, b) = 0 for all node-pairs (a, b). The JacSim* scores are symmetric, bounded, monotonic, unique, and always existent as shown in Appendix A. In Section IV-B2, we show that JacSim* outperforms Jac-Sim in terms of accuracy in similarity computation with all datasets.

B. MATRIX FORM
In this section, we provide a matrix form for our JacSim*. We start by proposing a matrix-based formula for Jaccard, which is employed to compute J-scores; the J-score of a nodepair (a, b) is calculated as follows [1]: We can rewrite (5) as follows: In order to calculate the numerator (i.e., the size of the intersection of I a and I b ) in (6), we provide the following formula: where A ∈ R |V |×|V | is the adjacency matrix of the graph and N ∈ R |V |×|V | is a matrix whose entry [N ] a,b indicates the size of the intersection of I a and I b .
We obtain the |I a | + |I b | part in the denominator of (6) as follows: where J ∈ R |V |×|V | is an all-ones matrix (i.e., all elements are set as one) and and each entry [A T ·J ] * ,j in a column j is identical to |I * | (i.e., corresponding entries in all columns are identical where ∀j, [A T · J ] * ,j = |I * |). As a result, in (8) |. Now, we can calculate the denominator of (6) as follows: where U ∈ R |V |×|V | is a matrix whose entry [U ] a,b indicates the size of the union of I a and I b (i.e., [U ] a,b = |I a ∪ I b |). Finally, we can calculate the J-scores by the following matrix formula: where denotes the Hadamard division (i.e., [ [26] and entry [J ] a,b denotes the J-score of (a, b). However, in the directed graphs, when I a = I b = ∅, the Hadamard division for node-pair (a, b) will be defined as NaN (i.e., not a number) since [N ] a,b = [U ] a,b = 0. In order to solve this problem, we rewrite (10) as follows: where denotes the Hadamard product (i.e., [ Now, we can propose a matrix form to compute the Jac-Sim* scores in a directed graph. In (3), instead of considering only nodes i ∈ I a and j ∈ I b , we consider all nodes in the graph to calculate the P-score of (a, b) as follows: we note that [A] i,a = 1 if i directly pointing to a (i.e., i ∈ I a ); otherwise, [A] i,a = 0. Equation (12) can be rewritten as follows: where , Q is the column normalized adjacency matrix), respectively; therefore, we provide the following matrix form for JacSim*: where entry [S] a,b in matrix S ∈ R |V |×|V | denotes the JacSim* score of (a, b), ∧ is the conjunction operator select- The recursive formula in (14) can be solved by the following iteration for k = 1, 2, . . .: the computation is initialized by matrix Z where all entries are set as 0.
Let us clarify the following points about the JacSim* matrix form: 1) as explained step by step, we employed a straightforward mathematical process to transform the Jac-Sim* iterative form to the matrix form without applying any changes to the original JacSim* formula in (3); our matrix form provides the exact JacSim* scores with no approximation. 2) Contrary to JacSim, our matrix form is represented and calculated only by matrix-based operations even in the case of matrix J . 3) Matrix J is computed once and reused in all iterations, we have two matrix multiplications in each iteration, and all matrices are represented by the CSC storage schema [25]; let m be the number of non-zero entries in matrix Q, then the time complexity to compute the JacSim* matrix form is O(km|V |).
In Section IV-B3, we show that the JacSim* matrix form is dramatically faster than its iterative form and also than the both forms of JacSim. In Appendix B, we propose JacSim* formulas (i.e., in both forms) that exploit out-neighbors in similarity computation instead of in-neighbors with directed graphs. We note that both of the JacSim* formulas proposed in Section III and Appendix B can be equivalently applied to undirected graphs.

IV. EXPERIMENTAL EVALUATION
In this section, we evaluate the accuracy and performance of our JacSim* in comparison with those of JacSim.

A. EXPERIMENTAL SETTINGS
We employ eight real-world datasets as follows; Table 1 shows some statistics of our datasets: Amazon [27] is a products co-purchasing graph collected by crawling Amazon website (i.e., if products a and b are frequently co-purchased, there is a link between them in the graph). The node labels denote the product category. To perform reasonable evaluation, we neglected labels with less than ten corresponding nodes; this graph is partially tagged by 71 labels.
BlogCatalog [12] is a graph where nodes represent bloggers and links do their social relationships. The node labels denote blogger interests inferred through the metadata provided by the bloggers; this graph is fully tagged by 39 labels.
CoraCitation [28] is a citation graph where nodes represent academic papers in the area of computer science and links do citation relationships among papers. The node labels denote the paper's topic (e.g., reinforcement learning); this graph is fully tagged by 70 labels.
DBLP [1] is a citation graph of papers in the areas of data mining and databases published in 2006 and earlier. The node labels denote the papers research topics created based on a famous data mining book [29] where the papers relevant to a chapter's research topic have been grouped together in the bibliographic section of the chapter; the graph is partially tagged by 11 labels corresponding to 11 chapters of the book.
EmailEU [30] is a graph constructed based on the email communication data of a European research institution (i.e., if member a sent at least one email to member b, there is a link from a to b in the graph). The node labels denote the working department of the member; this graph is fully tagged by 42 labels.
LiveJournal [27] is a graph representing social relationships among bloggers. The node labels denote bloggers interests, which are explicitly stated by bloggers themselves. To perform reasonable evaluation, we chose labels with more than ten and less than a hundred nodes; this graph is fully tagged by 7,086 labels.
TREC [1] is a hyperlink graph constructed based on TREC 2003 6 where nodes represent webpages and links do the hyperlinks among them. The node labels indicate the relevant query topic for the webpages, which are created based on LETOR 3.0 [31], a benchmark collection for research on learning to rank for information retrieval, released by Microsoft Research Asia; this graph is partially tagged by 11 labels.
Wikipedia [32] is a co-occurrence graph of words appearing in the first million bytes of the English Wikipedia dump. The labels represent the inferred Part-of-Speech (POS) tags of words. This graph is fully tagged by 40 labels.
To evaluate the accuracy, we utilize MAP (mean average precision), precision, recall, F-score [14], and PRES [33] as evaluation metrics. In each dataset D, labels are considered as ground truth sets and every single node tagged by a label l is used as a query node q for a similarity based searching to find top-t (t = 5, 10, 20, 30) nodes similar to q as a result set; if a node in the result set is originally tagged by l, it is labeled as relevant, otherwise irrelevant. For each value of t, we compute average precision (AP), precision, recall, F-score, and PRES for q; we take their average values over all the queries tagged by l to get the metrics values for l. Then, we compute the average values of MAP, precision, recall, F-score, and PRES over all labels in the dataset for t. Finally, the average values of the five metrics over all values of t are regarded as the final accuracy for D.
For JacSim, damping factor C is set as 0.8 and importance factor α is set as 0.4 by following [1]. In our experiments, we do not consider SimRank, PSimRank, C-Rank, and MatchSim since it has been shown than JacSim significantly outperforms all of those similarity measures in terms of both accuracy and performance [1]. All the experiments were performed on an Intel machine equipped with sixteen 3.60 GHz i9-9900K CPUs, 128 GB RAM, and a 64-bit Fedora Core 33 operating system. All required codes are implemented with Python 3.8.

B. RESULTS AND ANALYSES
In this section, we first perform a parameter tuning for Jac-Sim* and then analyze our experimental results.

1) PARAMETER TUNING
First of all, we note that for the parameter tuning and then accuracy comparison in Section IV-B2, we employ the   JacSim* matrix form represented by (15) instead of its iterative form represented by (4) since the matrix form provides exact JacSim* scores and it is dramatically faster than the iterative form as discussed in Section IV-B3. As explained in Section III, JacSim* has two parameters: C, the damping factor, and α, the importance factor; we aim to figure out how the accuracy of JacSim* changes when it is equipped with different values of C and α. Our experimental results with all datasets show that the accuracies of JacSim* equipped with different values of C (i.e., C ∈ {0.4, 0.6, 0.8}) are not tangible; thus, we set the value of C as 0.8 in accordance to JacSim. To find the best value of α, we conduct extensive experiments as follows. For each dataset D, we set the value of α in (15) as 0.1 to 0.9 in step of 0.1 and evaluate the accuracy of JacSim* on ten iterations for each case (i.e., we totally consider 720 = (8 × 9 × 10) different cases); the value of α providing the highest accuracy is selected as the best value for D. We do not consider α = 0 and α = 1.0 since in the former case, the similarity scores of any node-pairs (a, b) would be zero on all iterations and in the latter case, the similarity scores of any node-pairs (a, b) are computed based on only J-scores on all iterations where only direct in-neighbors of a and b are exploited. As an example, Table 2 shows the results of parameter tuning with DBLP and LiveJournal datasets; the values in the parentheses show the iterations on which the highest accuracy are observed (e.g., the highest accuracy with the DBLP dataset when α = 0.5 is observed on iteration 5) and the values in boldface show the highest accuracy.
As observed in Table 2, JacSim* shows the highest accuracy with DBLP and LiveJournal datasets when α = 0.2 and α = 0.3, respectively. Table 3 summarizes the complete results of our parameter tuning where the highest accuracy of JacSim* is observed when α is set as 0.2 or 0.3 with all datasets except one case (i.e., the Wikipedia dataset); it means JacSim* is not too sensitive to the value of α. For the sake of brevity, we set the value of α as 0.2 with all datasets for our experimental evaluations in Sections IV-B2 and IV-B3.

2) ACCURACY COMPARISON
In this section, we evaluate the accuracy of JacSim* in comparison with that of JacSim with our datasets in terms of MAP, precision, PRES, recall, and F-score; in this comparison, we consider the iterative form of JacSim since it shows higher accuracy that its matrix form [1]. Let us start with undirected graphs in Amazon, BlogCatalog, LiveJournal, and Wikipedia datasets; with each dataset, we consider the best accuracy of both JacSim* and JacSim observed in ten iterations. Table 4 represents the results of this comparison where the numbers in the parentheses show the iterations on which the best accuracies are observed (e.g., JacSim* and JacSim show their  best accuracies on iterations 7 and 6 with the Amazon dataset, respectively). As observed in the table, JacSim* shows batter accuracy than JacSim in terms of MAP, precision, PRES, recall, and F-score with all datasets; the reason is that Jac-Sim* exploits those paths in the graph neglected by JacSim in similarity computation, thereby providing higher accuracy; thanks to our random walk model explained in Section III-A. Now, we investigate the accuracy of JacSim* in comparison with that of JacSim with directed graphs in CoraCitation, DBLP, EmailEU, and TREC datasets; first, we exploit in-neighbors in similarity computation. In this comparison, we also consider the best accuracy of both similarity measures observed in ten iterations and the corresponding iterations are represented in parentheses as before. Table 5 demonstrates the results of this experiment where our observations are in accordance with those in Table 4; JacSim* outperforms JacSim in terms of five metrics with all datasets. Table 6 presents the results of experiments when outneighbors are exploited in similarity computation with our directed graphs. As observed in the table, JacSim* shows better accuracy than JacSim in terms of MAP, precision, PRES, recall, and F-score with all datasets except with the DBLP dataset where the both similarity measures show the identical accuracy in terms of all five metrics. The reason is that JacSim* and JacSim have their best accuracy on the first iteration where the similarity scores for any node-pairs (a, b) (i.e., a = b and O a , O b = ∅) are computed only based on the J-scores. For simplicity, let us explain this issue based on in-neighbors as follows; it is applicable to out-neighbors as well. In the case of JacSim* (refer to Section III-A), on the first iteration, S 1 (a, b) In the case of JacSim (refer to Section II), where S 0 (i, j) = 0 for all node-pairs (i, j) that i = j and node-pairs (i, j) that i = j are not considered in the computation; thus, S 1 (a, b) = α · [J ] a,b . Although the values of α are set as 0.2 and 0.4 in JacSim* and JacSim as respectively mentioned in Sections IV-B1 and IV-A, the accuracy of the both cases are same regardless of the value of α, since multiplying identical value [J ] a,b by constant α does not change the similarity-based ranking of node-pairs regarding any query node.

3) PERFORMANCE COMPARISON
In this section, we evaluate the performance (i.e., execution time) of JacSim* in comparison with that of JacSim as follows. We apply each of the JacSim* iterative form (JS * -IF), JacSim* matrix form (JS * -MF), JacSim iterative form (JS-IF), and JacSim matrix form (JS-MF) respectively represented by (4), (15), (1), and (2) to our datasets in ten iterations; we do not consider the required time to store the results of similarity computations in files or database. Fig. 3 illustrates the execution times of the above measures with our four undirected graphs; times are represented in minutes and the execution time of each similarity measure is also written on the top of its corresponding bar.
We have the following observations in Fig. 3. 1) Although JS * -IF outperforms JS-IF in terms of accuracy (refer to Section IV-B2), it is slower than JS-IF with all datasets; the reason is that JS * -IF exploits more paths in similarity computation than JS-IF does as explained in Section III-A. 2) JS * -MF is dramatically faster than all other three similarity measures with all datasets since it employs compressed matrices by the CSC storage scheme and only matrix-based operations for similarity computations, while both JS * -IF and JS-IF employ the conventional approach (i.e., ''for'' loops) and JS-MF is not computed by only matrix-based operations (i.e., matrices E and J are computed by employing ''for'' loops). 3) JS-MF is sensitive to the number of nodepairs with common neighbors in the graph such that it is slower with a small graph having a large number of nodepairs with common in-neighbors than with a large graph having a small number of such node-pairs. As an example, in spite of the fact that BlogCatalog and Wikipedia datasets have smaller number of nodes than those of Amazon and LiveJournal (refer to Table 1), the execution times of JS-MF with the two former datasets are higher than those with the two latter ones since the number of node-pairs with common neighbors in BlogCatalog (i.e., ' '32,787,165'') and Wikipedia (i.e., ''11,015,803'') are larger than those in Amazon (i.e.,''780,43'') and LiveJournal (i.e., ''1,452,366''). On the contrary, the execution time of JS * -MF depends on |V |; it shows the highest execution time with the largest undirected dataset (i.e., amazon with ''30,000'' nodes) and the lowest execution time with the smallest one (i.e., Wikipedia with ''4,777'' nodes). Figures 4 and 5 illustrate the results of the performance comparison with the directed graphs where in-neighbors and out-neighbors are exploited, respectively. In these figures, our observations are in accordance with those in Fig. 3. The execution time of JS * -IF is higher than that of JS-IF with all datasets when any of in-neighbors and out-neighbors are considered since JS * -IF exploits more paths than JS-IF does in similarity computation. JS * -MF is dramatically faster than all other similarity measures with all datasets regardless of the exploited neighbor type since it employs only compressed matrices along with matrix-based operations for similarity computations. JS-MF is sensitive to the number of nodepairs with common neighbors in the graph regardless of the  neighbor type. As an example, the DBLP dataset has smaller number of nodes (i.e., ''21,177'') than that of CoraCitation (i.e., ''23,166''); however, the execution times of JS-MF with DBLP based on both in-neighbors and out-neighbors are higher than those with CoraCitaion since the number of node-pairs with common in-neighbors (i.e., ''466,990'') and out-neighbors (i.e., ''2,214,105'') in DBLP are larger than those in CoraCitaion (i.e.,''229,306'' and ''1,100,051'', respectively). The execution time of JS * -MF depends on |V |; it shows the highest and lowest execution times with TREC as the largest directed dataset (i.e., with ''43,202'' nodes) and EmailEU as the smallest one (i.e., with ''1,005'' nodes), respectively, regardless of the exploited neighbor type.
In addition, it is worth to note that contrary to JS-MF, JS * -MF shows same execution times when exploiting inneighbors and out-neighbors with each dataset. For example, in the case of the TREC dataset, the execution times of JS-MF with in-neighbors and out-neighbors are 33.12 and 35.70, respectively (i.e., there is 2.58 minutes time difference) and those execution times of JS * -MF are 8.56 and 8.55, respectively (this very small difference is caused by inconsistent system resources such as CPU overload). This again shows that the performance of JS-MF depends on the number of node-pairs with common neighbors in the graph, while that of JS * -MF depends on the actual number of nodes.

V. CONCLUSION
In this paper, we first pointed out the three existing drawbacks of JacSim, a powerful variant of SimRank alleviating the pairwise normalization problem, as follows: 1) JacSim neglects some paths in the graph in similarity computation, which adversely affects its accuracy; 2) the JacSim matrix form provides the approximate similarity scores; thus, it shows lower accuracy than its iterative form, and 3) the JacSim matrix form still suffers from the low performance since it is sensitive to the number of node-pairs with common neighbors in the graph. Then, we proposed JacSim*, which effectively solves the above three issues along with the pairwise normalization problem, it shows higher accuracy than JacSim with eight real-world datasets, its matrix form provides the exact similarity scores and identical accuracy to that of the iterative form while it is dramatically faster than its own iterative form and the both forms of JacSim, its matrix form is not sensitive to the number of node-pairs with common neighbors, and it has more simpler, easier to understand, and easier to implement formulas in both iterative and matrix forms than the original JacSim.
We figured out interesting directions for our future work as follows. We plan to extend JacSim* to compute similarity of nodes in signed graphs, where two types of links (i.e., positive and negative) exist. It has been shown that negative links contain additional information, which is beneficial to various tasks such as link sign prediction and node classification in signed graphs [34]. Furthermore, to highly improve the performance of JacSim* with very large graphs (i.e., with millions or billions of nodes), we plan to propose an acceleration technique (e.g., partial sums memoization [35]) for our JacSim*; in this case, we need to apply such technique to only matrix multiplications in (15) since matrix J is computed once and reused in all iterations.

APPENDIX A
In this section, we show that the JacSim* scores are symmetric, bounded, monotonic, unique, and always existent.
By the bounding and monotonicity properties, for any node-pairs (a, b), S k (a, b) is bounded and non-decreasing as k increases; a sequence S k (a, b) converges to a lim S (a, b) = S(a, b) in [0, 1], according to the Completeness Axiom of calculus. lim k→∞ S k+1 (a, b) = lim k→∞ S k (a, b) = S (a, b) and the limit of a sum is identical to the sum of the limits, therefore Since 0 < α < 1 and 0 < C < 1, then 0 < C ·(1−α) < 1, which means M = 0.

APPENDIX B
In this section, we provide JacSim* formulas that exploit outneighbors in similarity computation instead of in-neighbors in directed graphs. Let us define O a as a set of nodes directly pointed to by node a (i.e., direct out-neighbors of a). Then, the similarity score of a node-pair (a, b) by considering outneighbors is computed as follows by JacSim*. If a = b, then S(a, b) = 1; if a = b and O a = ∅ or O b = ∅, then S(a, b) = 0; otherwise S(a, b) = S (a, b) that is obtained by the following formula: where the base case of the recursive computation is S (a, b) = 0 if a = b; this base case guarantees that only those out-neighbor pairs (i, j) where i = j are considered in calculating the P-score. The overall mathematical process to transform the JacSim* iterative form to a matrix form based on out-neighbors is exactly similar to the one represented in Section III-B except we exploit out-neighbors instead of in-neighbors. The JacSim* matrix form based on out-neighbors is as follows: where entry [S] a,b in matrix S ∈ R |V |×|V | denotes the similarity score of (a, b), Q ∈ R |V |×|V | is the row normalized adjacency matrix whose entry We note that both recursive formulations in (16) and (17) can be solved by the iteration to a fixed-point similar to the ones explained in Section III.