Clustering and Closure Coefficient Based on k-CT Components

Real-world networks contain many cliques since they are usually built from them. The analysis that goes behind the cliques is fundamental because it discovers the real structure of the network. This article proposed new high-order closed trail clustering and closure coefficients for evaluation of the network structure. These coefficients are able to describe the inner structure of the network concerning its randomized or organized behavior. Moreover, the coefficients can cluster networks with similar structures together. The experiments show that the coefficients are useful in both the local and global context.


I. INTRODUCTION
The networks that are built in real life have many standard features. The most important feature is related to their evolution and the process through which they are created. Co-authorship networks depict the fact that authors co-authored a book or a paper. If the book has three authors, a clique on three vertices is added to the network, similarly for more authors. Social networks suggest a connection between members according to the current relationship, regardless of whether it is on the internet or in real life. Therefore, the analysis that goes behind the cliques is fundamental.
The global clustering coefficient (or transitivity [1]) is a standard approach to characterizing networks and the tendency of vertices to clustering. In the article [2], a higher-order clustering coefficient as a natural generalization of the traditional clustering coefficient is defined. Higher-order cliques beyond triangles are crucial to an understanding of complex networks and the clustering behavior of their vertices concerning the standard metric on network structures.
The local closure coefficient [3] is defined in a similar way to the standard local clustering coefficient. It is a metric quantifying head-node-based edge clustering, and it is defined as The associate editor coordinating the review of this manuscript and approving it for publication was Tu Ngoc Nguyen . the fraction of length-2 paths starting from the head node that induces a triangle. This small difference in definition, leads to different properties than the traditional clustering coefficient has. Benson et al. [4] developed a generalized framework for clustering networks based on higher-order connectivity patterns. Their results show that networks exhibit rich higher-order organizational structures detected by clustering based on higher-order connectivity patterns. The article [5] continues with the idea of higher-order graph clustering, and the authors present a class of local graph clustering methods that incorporate higher-order network information captured by network motifs. The higher-order structure is also the focus of the article [6]. The authors found that tie strength and edge density are the competing positive indicators of higher-order organization. These trends are consistent across interactions that involve a different number of nodes.
The measuring of the distances between two nodes in a graph is a frequent task. The standard measure for this distance is the shortest path (d SP (u, v)) between two nodes u, v in a graph [7], [8]. Another way that can be used is the expected lengths of the commuting time distance [9]. Variants of node distances are described in detail in [10]- [13].
The closed trail distance (d CT (u, v)) as a metric in the graph is based on the definition of a biconnected component. The distance between two vertices in the graph is defined as the length of the shortest closed trail that contains these two vertices. A k − CT component is maximal subgraph that contains those vertices for which the closed trail distance among the vertices is less than or equal to k.
The k − CT components that are detected highlight locally and cyclic connected subgraphs. Moreover, these components are not based on the biconnectivity property and, therefore, they can easily partition densely connected biconnected components. These components are more difficult to partition and detect the structure of communities. A list of the largest biconnected component in the selected networks was published by Leskovec et al. [14].
Local clustering and closure coefficients measure the tendency of vertices to be in a cluster. Both are based on the expansion of the clique. The higher-order clustering and closure coefficients are based on higher-order (bigger) cliques. In the graph, we can detect a dense subgraph, which is not a clique, but it is very close to a clique. This subgraph can be composed of numerous smaller cliques, and they create the k − CT component. The new approach to clustering and closure coefficient is based on the expansion of k − CT components to a (k + 1) − CT subgraph. Higher-order clustering and closure coefficients are integrated into the clustering and closure k − CT coefficients because all 3 − CT components are cliques. Sparser subgraphs, which can contain structural holes of the graph or chains of k − CT components with a smaller k, are detectable via k − CT components with a higher k.
The organization of the article is as follows. First, the terminology and the notation, which is used in the article, are introduced. In the next section, the closed trail distance in connected undirected graphs without bridges is defined. Moreover, the new coefficients based on k − CT components are introduced. These coefficients extend the clustering and closure coefficient and characterize the tendency of vertices to participate in some (k +1)−CT subgraph. Section IV contains the experimental results of selected real networks and two types of generated networks. In conclusion, the advantages and limitations of the coefficients that are defined are discussed.

II. TERMINOLOGY AND NOTATION
In this section, knowledge of graph theory will be required. The definitions of the following terms were taken from [15]: A walk on a graph is an alternating series of vertices and edges such that for j = 1, . . . , k the vertices v j−1 and v j are the endpoints of the edge e j . A closed walk is a walk in which the initial vertex is also the final vertex. The length of a walk is the number of edges. We will denote the length of a walk as |W (u, v)|. A trail is a walk in which no edge occurs more than once. A closed trail is a closed walk with no repeated edges. We will denote a closed trail which contains the vertices u, v as A path is a walk in which no edge or internal vertex occurs more than once (a trail in which all the internal vertices are distinct). We will denote a path with an initial vertex u and a final vertex v as P (u, v). A circuit is a closed trail. A cycle is a closed path with a length at least one and an induced cycle of length four or more is a hole. A clique is a subgraph in which each vertex is adjacent to every other vertex. We will denote the clique with k vertices as Q k . A diameter of graph is the maximum of distances between any pair of vertices in the graph.
A connected graph is a graph such that between every pair of vertices, there exists a walk. A biconnected graph is a connected and ''nonseparable'' graph, meaning that if any vertex were to be removed, the graph would remain connected. A component of a graph is a maximal connected subgraph. An edge e is a bridge (cut-edge) of the connected graph G if {e} is a disconnecting edge-set of G. An articulation is a vertex of a graph which removal increases the number of components. Therefore, a biconnected graph has no articulation vertices. A biconnected component is a maximal biconnected subgraph.

III. COEFFICIENTS BASED ON CLOSED TRAIL DISTANCE
The closed trail distance is a metric between vertices in a connected graph without bridges and loops. It is useful for the detection of subgraphs with a specified CT −distance among the vertices. A maximal k − CT subgraph is a k − CT subgraph that cannot be extended by including one more adjacent vertex.
where CT (u, v) is a closed trail that contains the vertices u, v. Then the function d CT is called the closed trail distance (CT -distance). Theorem 1: The CT −distance is a metric on the set V.
The theorem was proven in the article [16].
The lemma was proven in the article [16].
is a metric in any connected graph without bridges and defines the distance between two nodes u and v.
We can define the CT −distance for a disconnected or connected graph with bridges in this way: Definition 3: The CT −distance between the vertices u and v is equal to ∞ (d CT (u, v) = ∞) if not closed trail containing these vertices exists.

B. HIGHER-ORDER CLUSTERING AND CLOSURE COEFFICIENTS
The local clustering coefficient [17] of a vertex u in the network G = (V , E) is the fraction of wedges centered at the vertex u that are closed. The wedge W c 2 is a subgraph composed of a clique Q 2 and an edge which are connected in the vertex u (see Table 1 coefficient C 2 -2-wedge ). The local higher-order clustering coefficient for the vertex u is defined in [2] as: where K k (u) is the set of k-cliques containing u, W c k (u) is the set of k-wedges (see Table 1 coefficient C 4 -4-wedge) with its center in the vertex u and d u is the degree of the vertex u. If |W c k (u)| = 0, then C k (u) is undefined. The average kth-order clustering coefficient C k is the mean of the local kth-order clustering coefficients, where V k is the set of nodes in the network in which the local kth-order clustering coefficient is defined.
The global higher-order clustering coefficient of the network G = (V , E) is defined in [2] as: where K k+1 is the set of (k + 1)-cliques in G and W c k is the set of k-wedges, where a k-wedge is composed of a clique with k vertices and an edge. They are connected in the vertex u which is common for the clique and the edge.
A closure coefficient [3] is defined in a similar way. The local closure coefficient of a vertex u in the network G = (V , E) is the fraction of the wedges headed at the vertex u that are closed. The wedge W h 2 is a subgraph composed of a clique Q 2 and an edge which are connected in the vertex v and the vertex u is the head of the edge (see Table 1 coefficient H 2 -2-wedge ). The local higher-order closure coefficient for the vertex u is defined as: where K k (u) is the set of k-cliques containing u and W h k (u) is the set of k-wedges (see Table 1 coefficient The average kth-order closure coefficient H k is the mean of the local kth-order closure coefficients, where V k is the set of nodes in the network in which the local kth-order closure coefficient is defined.

C. CLUSTERING AND CLOSURE COEFFICIENTS BASED ON k − CT COMPONENTS
We denote the set of all k − CT subgraphs containing the vertex u as k − CT (u). The set of all the k − CT components which contain the vertex u is denoted as M k (u). From the VOLUME 8, 2020   Table 1, row H CT 4 ) and it is denoted as W h k−CT . The shortest closed trail which contains two vertices has to have a length greater than or equal to 3. It is the reason why coefficients are defined for k ≥ 3.
The local k − CT clustering coefficient of a vertex u in the network G = (V , E) is the fraction of the CT wedges centered at the vertex u that are closed. The local higher-order k − CT clustering coefficient for the vertex u is defined as: If |W c k−CT (u)| = 0, then C CT k (u) is undefined. The average kth-order clustering CT coefficient C CT k is the mean of the local kth-order clustering CT coefficients, where V k is the set of nodes in the network where the local kth-order clustering CT coefficient is defined. The interpretation of the local k − CT clustering coefficient, is described as the expansion of k − CT components to (k + 1) − CT subgraphs (see Table 1, row C CT 4 ). The global k −CT clustering coefficient C CT k is defined as the fraction of the k − CT wedges centered at u that are closed, meaning that they induce a (k + 1) − CT subgraph in the network. We can formulate this as: The local k − CT closure coefficient of a vertex u in the network G = (V , E) is the fraction of the CT wedges headed at the vertex u that are closed. The local higher-order k − CT closure coefficient for the vertex u is defined as: The average kth-order closure CT coefficient H CT k is the mean of the local kth-order closure CT coefficients, where V k is the set of nodes in the network in which the local kth-order closure CT coefficient is defined.

D. METHODS FOR k − CT COMPONENTS COMPUTATION
In the graph G = (V , E) we need to detect all the maximal k − CT subgraphs (k − CT components) for the computation of the coefficients. k − CT components are detected from the matrix of closed trail distances. We denote this full matrix as T and T ij = d ct (i, j).
All the triangles and quadrangles in the graph are detected to fill the matrix T . the Chiba and Nishizeki algorithm [18] is used for these computations. The CT − distances d ct (i, j) ≥ 5 are detected via the connection of the two shortest disjoint paths [19] between i and j, where the connection of these shortest paths creates a closed trail.
The k − CT component in the graph G = (V , E) is the maximal clique in the weighted graph G k = (V , E k ) where {i, j} ∈ E k ⇔ T i,j ≤ k. Maximal cliques in G k are detected with the Bron-Kerbosch algorithm [20].

IV. EXPERIMENTAL RESULTS
The experiments concentrate on comparing standard coefficients and k − CT coefficients in selected real networks and two types of generated networks. Real networks were used for the experiments, as in the article [2]. Biological networks are represented by dataset C.elegans (a complete neural system) and Dros.-medulla (neural connections). Zachary Karate Club is a real small social network, and fb-Stanford and fb-Cornell are online friendship social networks on Facebook among students at universities since 2005. Two co-authorship networks are constructed from arxiv submission categories (arxiv-AstroPh and arxiv-HepPh). Human communication networks are created from emails (email-Enron-core, email-Eu-core) and Facebook-like messages among colleges (CollegeMsg). Oregon2-01052 is a technological network of an autonomous system. A Barabási-Albert (BA) model [21] of a network was used for generating 14 networks with increasing numbers of edges (2, 3, 5, 7, . . . , 50) attached from a new node to existing nodes. The process of generating was repeated 15 times with 5 various random seeds, and the maximal number of vertices was n ∈ {100, 150, 200}. The result of the generating is 210 networks.
A Watts-Strogatz (WS) model [17] of the network is the second model which was used for generating 20 networks    A brief description of network parameters for all types of networks are summarized in table 2. Table 3 contains two specific networks (arxiv-HepPh and arxiv-AstroPh) that have the biggest average shortest path distances and their CT −diameters (dim CT ) are not (2 * dim SP ) or (2 * dim SP + 1). Figure 1 describes a situation in which the dim CT is bigger than the (2 * dim SP + 1). All the k − CT coefficients are calculated for smaller networks. Bigger networks have a huge number of k − CT components (see Table 4) which leads to more expensive computation of the k − CT coefficients for k ≥ 4. Higher-order clustering and closure coefficients use cliques of various sizes, as do the C CT 3 and H CT 3 coefficients. The global coefficient C CT k is a fraction of k −CT centered wedges that are closed, meaning that they induce a (k + 1) − CT subgraph. In the situation when k = (dim CT − 1) then every k −CT centered wedge has to be closed to the (k +1)− CT subgraph because the dim CT is the maximal value of the CT −distance between the vertices and the closed wedge has to have a maximal CT −distance between the vertices equal to dim CT of the graph. Then the coefficient C CT (dim CT −1) has a value equal to one (see Figure 2(a)).
The average clustering and closure k − CT coefficients C CT k , H CT k have similar behavior for selected networks to the global coefficient C CT k (see Figure 2). Figure 2(b) shows the tendencies of average coefficients.
With an increase in the parameter m, networks created with the BA model (see Figure 3(a)) have higher density and then a smaller diameter of the network and an increasing value of C CT k , which goes to one, either faster or slowly. These networks with m ≥ 5 have dim CT ≤ 7 and then the value of C CT 6 is equal to one. Networks created with the WS model (see Figure 3(b)) started with a regular graph and an increasing rewiring probability p, causing a random graph with the same density and with a smaller CT −diameter. The coefficient C CT 6 increases in the range {5, . . . , 11} of the CT −diameter with increasing p. The coefficient C CT 3 for increasing p decreases to zero because the CT −diameter decreases to 8 and the density is the same as in the regular network.
The 3−CT clustering coefficient is calculated with cliques of all sizes and the resulting value is appropriate to the cumulative value of the higher-order clustering coefficients. The coefficients C CT 3 and C 2 have very often a similar tendency (see Figure 4) when they depend on the node degree. The same tendency is more significant for the closure coefficients H CT 3 and H 2 in the selected network (see Figure 4(b)).
The selected real networks have different values for their global and average local coefficients (see Figure 5(a)). The 3 − CT coefficients have mostly a greater range and a higher value. The calculation of the coefficients is not restricted only to cliques. The 3 − CT coefficients represent the fraction of 4 − CT subgraphs to wedges based on cliques. The 4−CT subgraphs are still dense, but they are not as strict as the cliques. The extension to the k −CT components allows the calculation with parts of the graph with k − CT distance between the vertices.

V. CONCLUSION
This article suggests a new higher-order closed trail based clustering and closure coefficients that were designed for the discovery of the features of networks that are behind their clique-based structure. In many networks, cliques represent the way the network is built. The co-authorship networks contain cliques of co-authors connected with other cliques using common authors. Actor-Actor networks are build using the interconnection of cliques of actors. Therefore, the structure behind the cliques is the real structure of the networks. The coefficients C CT 3 and H CT 3 , as well as their averaged values C CT 3 and H CT 3 , provide completely new knowledge about networks. The coefficients' values can identify the nature of the networks and consider their chaotic or organized behavior. Moreover, we demonstrated the relationship between the selected networks with the Barabási-Albert and Watts-Strogatz models. We computed coefficients for both models with different parameters, and any network may be compared to them, and the most similar parameter for each model may be chosen. Both parameters may be used as features for network similarity measurements because of the differences in the behavior of each model. The experiments were performed on the largest connected components without bridges of 11 well-known networks with hundreds of nodes up to seventeen thousand and up to eight hundred thousand edges. The results show that the coefficients are able to distinguish between different types of networks and cluster the networks across the source area.
PETR PROKOP received the B.Sc. and M.Sc. degrees in computer science from the VSB-Technical University of Ostrava, Ostrava, Czech Republic, in 2016 and 2018, respectively. He is currently pursuing the Ph.D. degree in computer science focused on data science and machine learning. His current research interests include big data and social network analysis.
VÁCLAV SNÁŠEL (Senior Member, IEEE) received the master's degree in numerical mathematics from the Faculty of Science, Palacky University, Olomouc, Czech Republic, in 1981, and the Ph.D. degree in algebra and number theory from Masaryk University, Brno, Czech Republic, in 1991. His research and development experience includes more than 30 years in the industry and academia. He is currently a Full Professor with the VSB-Technical University of Ostrava, Ostrava, Czech Republic. He also works in a multidisciplinary environment involving artificial intelligence, social networks, conceptual lattice, information retrieval, semantic web, knowledge management, data compression, machine intelligence, and nature and bio-inspired computing applied to various real-world problems. He has authored or coauthored several refereed journals/conference papers, books, and book chapters. He is the Chair of the IEEE International Conference on Systems, Man, and Cybernetics and the Czechoslovak Chapter. He also served as an Editor/Guest Editor for several journals, such as Engineering Applications of Artificial Intelligence (Elsevier), Neurocomputing (Elsevier), and Journal of Applied Logic (Elsevier).
PAVLA DRÁŽDILOVÁ (Member, IEEE) received the Ph.D. degree in computer science from the VSB-Technical University of Ostrava, Ostrava, Czech Republic, in 2012. She has coauthored of over 40 scientific articles published in proceedings and journals. Her citation report consists of 58 citations and H-index of five on Web of Science, 137 citations and H-index of seven on Scopus, and 300 citations and H-index of ten on Google Scholar. Her research interests include data mining, social and complex networks, and graph theory.
JAN PLATOŠ (Member, IEEE) received the Ph.D. degree in computer science from the VSB-Technical University of Ostrava, Ostrava, Czech Republic, in 2006.
He was an Associate Professor in computer science, in 2014. Since 2017, he has been the Head of the Department of Computer Science, Faculty of Electrical Engineering and Computer Science, VSB-Techincal University of Ostrava. He has coauthored more than 200 scientific articles published in proceedings and journals. His citation report consists of 338 citations and H-index of ten on Web of Science, 800 citations and H-index of 14 on Scopus, and 1213 citations and H-index of 17 on Google Scholar. His research interests include text processing, data compression, bio-inspired algorithms, information retrieval, data mining, data structures, and data prediction.