A Framework for Subgraph Detection in Interdependent Networks via Graph Block-Structured Optimization

As the backbone of many real-world complex systems, networks interact with others in nontrivial ways from time to time. It is a challenging problem to detect subgraphs that have dependencies on each other across multiple networks. Instead of devising a method for a specific scenario, we propose a generic framework to discover subgraphs in multiple interdependent networks, which generalizes the classical subgraph detection problem in a single network and can be applied to more practical applications. Specifically, we propose the Graph Block-structured Gradient Hard Thresholding Pursuit (GB-GHTP) framework to optimize interdependent networks with block-structured constraints, which enjoys 1) a theoretical guarantee and 2) a nearly linear time complexity on the network size. It is demonstrated how our framework can be applied to three practical applications: 1) evolving anomalous subgraph detection in dynamic networks, 2) anomalous subgraph detection in networks of networks, and 3) connected dense subgraph detection in dual networks. We evaluate our framework on large-scale datasets with comprehensive experiments, which validate our framework’s effectiveness and efficiency.


I. INTRODUCTION
A graph 1 G = (V, E) refers to a set of entities, denoted as nodes V, along with some connections between node pairs, represented by edges E. Due to its generic structure, graphs have the capability to model various applications, including the natural sciences, social sciences, and engineering [1]- [3]. A canonical challenging problem in graph analytics is the detection of subgraphs. Subgraph detection is useful in many fields, such as intrusion detection in computer networks [4], [5], disease outbreak detection [6], event detection in activity networks [7], [8], and traffic congestion detection [9], [10]. In this paper, we focus on subgraph detection in attributed networks, in which nodes in a graph are associated The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Asif . 1 In this article, the terms graph and network are used interchangeably.
with attributes. Assume a node i is associated with an attribute vector w i ∈ R P ; then the attribute matrix defined on the whole graph can be denoted as W ∈ R P×N , where |V| = N . Subgraph detection in attributed networks usually refers to a problem that finds a subset of nodes whose attributes are anomalous or significant compared to those nodes outside of the subset [11]. The problem simultaneously deals with the structure of a network (e.g. connectivity, density, compactness or isomorphism) and its attributes on nodes. Generally, the typical subgraph detection problem in isolated attributed networks can be formulated as a combinatorial optimization problem as follows: s.t. S satisfies some predefined topological constraints (1) where F(·) is a user-specified objective function depending on applications, and S is any subset of V of a network. The subgraph to be found is determined by the definition of F(·), which has different definitions in different applications. However, in reality, networks rarely appear in isolation. It is common that real-world systems interact with each other. For example, diverse critical infrastructures are coupled together, such as systems of food and water supplies, fuel, communications, financial transactions and power generation and transmission [12]. Specifically, thermal power stations forming the nodes of a power grid require fuel supplied via road or pipe networks and are also controlled by the nodes in a communication network. Although the transportation network does not depend on the power grid to function, the communication network does. Thus cascading failures of a system can originate from the deactivation of a critical number of nodes in either the power grid or the communication network. If the two networks are treated in isolation, this important feedback effect cannot be seen, which further affects the location of malfunctioning nodes. However, jointly detecting significant nodes in both power grid and communication networks can provide deep insight into the functionality dependencies between the two networks. In the following, we introduce interdependent networks to model multiple networks with dependencies across different networks. Interdependent networks are comprised of multiple networks {G 1 , . . . , G k , . . . } and dependency edges E 0 , where G k = (V k , E k ) and E 0 is the set of dependencies between networks. The elements in E 0 are determined by a specific application. For example, in the example of power grid, the edge set E 0 refers to connections between thermal power stations and road or pipe network. V k and E k refer to nodes and edges of the k th network. Multiple networks interacting with each other appear in almost every aspect of science and technology. For instance, a dynamic network can be viewed as multiple networks with implicit node-level temporal dependency, in which each network represents a snapshot of the dynamic network at a specific timestamp. In such networks, every node's attributes in the current timestamp implicitly depend on attributes in the previous timestamp (as shown in Fig. 1a). Another trivial example of interdependent networks is the web-scale social network consisting of many communities. In such networks, communities can be viewed as small networks or blocks that have explicit connections between each other (as shown in Fig. 1b), which technically form a network of networks. A nontrivial example of interdependent networks is the dual networks built from the citation dataset, where one network builds the coauthorships among researchers, and the other network models the research interests between researchers. Both networks have identical node sets (researchers), whereas the edge sets represent the coauthorships and research interest similarities. An example of dual network is shown in Fig. 1c.
Because of the ubiquity of attributed interdependent networks, it is useful to propose generic methods to solve the problem of subgraph detection in interdependent networks. Correspondingly, subgraph detection in multiple interdependent networks can be framed as a block-structured optimization problem with multiple topological constraints on blocks: min S k ⊆V k F(S 1 , . . . , S K ) = f (S 1 , . . . , s.t. S k satisfies some predefined topological constraints (2) where f (·) is a user-specified function to capture signals on nodes of interdependent networks, and g(V i , V j ) models dependencies between network G i and network G j . S k is a subset of nodes in the k th network G k , k = 1, . . . , K . K refers to the number of networks in the interdependent networks. For example, subgraph detection in a dynamic network finds a sequence of subsets of nodes in a sequence of blocks, where the detected subgraphs in each block must satisfy a predefined topological constraint and subgraphs at two consecutive timestamps share some consistency on attributes [4]. K denotes the number of timestamps in this scenario.
It is readily concluded that the vanilla subgraph detection problem (1) is a special case of problem (2) when the number of networks (blocks) is 1. Problem (2) is built on discrete space. Since the solutions for combinatorial optimization with topological constraints are undeveloped, we naturally turn to nonconvex optimization (optimization techniques for continuous space) for help. To make nonconvex optimization suitable for our scenario, the relaxation of problem (2) from discrete space to continuous space is needed. Then, we can apply a series of nonconvex optimization techniques, such as stochastic gradient descent and Adam [13]. However, due to exponentially many solutions of subgraph detection in interdependent networks, it is infeasible to search the exact solution for a large network (e.g. |V| ≥ 10, 000) with brute force in an acceptable time [4], [14]. Hence, most existing methods for addressing this problem find an approximation or suboptimal solution heuristically within an acceptable runtime, which attempts to balance effectiveness and efficiency. To the best of our knowledge, most related studies on subgraph detection in interdependent networks focus on a specific application and lack generality. Furthermore, they are heuristic-driven without any theoretical guarantee. In this paper, we explore possible solutions for graph block-structured optimization by leveraging sparse optimization theories and approximate projections for graph-structured sparsity, aiming to provide a generic framework for subgraph detection problems in interdependent networks with tractable computation as well as provable theoretical guarantees. The contributions of our research can be summarized as follows: • Design of a framework for graph block-structured optimization. We propose a novel generic framework, named Graph Block-structured Gradient Hard Thresholding Pursuit (GB-GHTP), for the graph block-structured optimization problem, which is efficient and useful for approximately solving a broad of class of subgraph detection problems in interdependent networks in nearly linear time on the network size.
• Theoretical guarantees. We analyze the theoretical properties of our proposed framework and prove that the framework enjoys a good convergence rate and a tight error bound on the quality of the results. The time complexity of our algorithm is also be analyzed, which is nearly linear with the network size and has provably good efficiency.
• Comprehensive experiments in multiple practical applications. We demonstrate that our framework can be applied to three practical applications: 1) evolving anomalous subgraph detection in dynamic networks, 2) anomalous subgraph detection in networks of networks, and 3) connected dense subgraph detection in dual networks. Comprehensive experiments are conducted on both synthetic and real-world datasets to validate the effectiveness and efficiency of our framework. The remaining parts of this paper are organized as follows. Section II discusses related work relevant to our research. Section III introduces the relaxation and formulation of our problem. Section IV presents an efficient framework for general graph block-structured optimization and its theoretical properties. Section V shows how to model three practical applications as graph block-structured optimization problems and solve them with our framework.
Comprehensive experiments on synthetic and real-world datasets are presented in Section VI. Section VII concludes the paper and describes future work.

II. RELATED WORK A. SUBGRAPH DETECTION IN ATTRIBUTED NETWORKS
Subgraph detection in attributed networks often refers to finding those nodes and edges whose behaviors are significantly different from the behaviors of those outside the subgraphs [11]. The detected subgraphs are usually supposed to satisfy some constraints, such as connected subgraphs, dense subgraphs, compact subgraphs, subgraphs with regular shapes (e.g., circles and rectangles), and subgraphs that are isomorphic to a query graph. There are a large number of applications or problems concerned with subgraph detection. They can be listed as follows: detection of subnetwork biomarkers [15], detection of road traffic congestion events, detection of abnormally high breakage in a distribution network [16], detection of disease outbreaks [6], [17], and event detection in social networks [6], [7]. According to the dynamics of attributes, attributed graphs fall into two categories: static graphs and dynamic graphs. A typical example of a static graph is the molecular structure of proteins and nanomaterials, whose molecular structure does not change over time [2], [3]. Social networks are a good example of dynamic graphs, in which friendship links are added or removed at any time; thus, the graph changes over time. For subgraph detection in static networks, those methods can be further divided into two parts, which handle spatial networks and complex networks. For spatial networks like a street network, most studies are statistical approaches. These methods typically assume that the attributes (e.g., traffic volumes in a street intersection) follow some distribution in Euclidean space. The goal is then to detect whether there exists a subarea where the attributes are in the same distribution but with a higher density parameter. For example, expectation-based statistics can be used to scan subsets and find anomalous space areas [18], while these methods do not consider any topological constraints. Others consider graph structures and propose the graph scan method [19] to detect connected subgraphs. Some studies also extend the graph scan method and introduce soft constraints on temporal consistency to find dynamic patterns [20]. The graph scan methods and their variants use the LTSS property, which rules out subgraphs that are suboptimal and dramatically reduces the search space [19]. For subgraph detection in static complex networks, the nonparametric heterogeneous graph scan was proposed to detect events in heterogeneous social networks [6] and can be used for civil unrest prediction, rare disease outbreak detection, and early detection of human rights events. All of the above methods are statistical approaches and cannot provide any theoretical guarantee. Alternatively, a class of subgraph detection problems can be framed as a general submodular (but not monotone) maximization problem and used to detect activity in networks [7]. Another work relaxes the nonconvex constraints to convex and introduces the constraints as a regularization term, which provides a performance bound [17]. However, the method is not scalable to large graphs (≥ 1, 000 nodes). For subgraph detection in dynamic networks research, topological constraints on graphs and attributes' dependencies between different timestamps are considered. Meden [21] mines the heaviest dynamic subgraph (region) with the maximum score defined on nodes and edges. NetSpot [14] defines smoothness between different timestamps. Both methods prune most instances to be searched and make themselves scalable. Dynamic GraphScan [20] uses expectation-based statistics and soft consistency constraints, which are efficent and can scale with the instance size. These methods for dynamic networks are statistical approaches without theoretical guarantees and are designed for some specific scenarios, which restricts their applications. The aforementioned algorithms [4]- [6], [9], [10] mainly leverage statistical theories, which cannot optimize on raw data and thus rarely give any theoretical guarantee for optimization.

B. STRUCTURED SPARSE OPTIMIZATION
In the past decade, sparsity has arisen as an important tool in many fields, such as statistics, signal processing, and machine learning. In many settings, sparsity is useful because it enables us to identify structures in high-dimensional data while still remaining a mathematically tractable concept [22]. Structured sparsity models refer to a class of sparsity models that discover patterns in high-dimensional data with prior knowledge about their structures. Recently, a number of structured sparsity models defined through trees [23], groups, clusters, and paths [24] have been proposed. Generally, an optimization problem based on a structured sparsity model can be defined as where f : R n → R is a differentiable objective function, the support set supp(x) denotes the set of indices of nonzero entries in x, and M is the sparsity model and represents a family of structured supports, i.e., M = {S 1 , S 2 , . . . , S L }, where S i ⊆ [n] = {1, 2, . . . , n} satisfies a certain structure constraint (e.g., groups, trees or clusters). For example, a k-sparsity model is defined as Structured sparse optimization can be implemented by two main methods: 1) encoding the structured sparsity model as an induced norm and embedding it into the objective function, where the induced norm is usually non-Euclidean and nonsmooth. Or we can 2) leverage a projection oracle on M, which is defined as and decompose the problem into two subproblems, including the optimization of f (x) independent of the structured sparsity constraints and the projection problem (4). Most methods via a projection oracle require exact solutions to the projection problem (4), which are usually unavailable. For instance, if we require the connectivity of the supp(x), the projection oracle is reduced from prize collection Steiner tree problem that is NP-hard. However, when we use an approximate projection, the theoretical guarantees of those methods no longer hold. A recent approach named Graph-CoSaMP [22], [25], attempts to introduce an approximation framework for sparsity structured models defined via graphs and provide a theoretical guarantee, in which an efficient approximate projection algorithm that runs in nearly linear time is proposed. There are two components in the approximate projection algorithm, including head and tail approximate projections, which provide a theoretical guarantee as long as they are utilized jointly. Although Graph-CoSaMP shows good performance for finding trees or clusters in data with graph structures, it is only applicable in linear regression or compressive sensing. Some works have generalized Graph-CoSaMP and proposed algorithms for graph-structured sparsity optimization problems [24], [26], [27]. They evaluate their methods in the problems of connected subgraph detection and interesting subspace detection. Reference [24] proposed Graph-IHT and Graph-GHTP to solve the structured optimization problem on a single graph. These methods are variants of iterative hard thresholding (IHT) and gradient hard thresholding pursuit (GHTP) [28], respectively, in which the projection oracle is approximately solved by the head and tail approximations.
References [26], [27] used the same idea as [24] on match pursuit (MP) and designed the Graph-MP and SG-Pursuit methods. Graph-MP aims for subgraph detection, while the task of SG-Pursuit detects subspace. Our work generalizes the aforementioned ideas in [24], [26], [27] in that we can solve combinatorial optimization problems with topological constraints via structured sparsity optimization. Other works such as [5], [10] have been designed for uncovering specific-shape subgraphs via nonparametric statistics, which do not possess the ability to run on raw data. More importantly, those works only handle structural constraints defined on an isolated network (i.e. M defined on a single graph in problem (3)). We propose a generic framework to solve general optimization problems on multiple networks with interdependency. Thus, previous methods [24], [26], [27] are special cases of our framework when constraints are defined on a single network. The aforementioned methods, IHT and MP, are two classical algorithms for general sparse optimization problems. GHTP is an improved method based on IHT and GHTP, which iterates between 1) a standard gradient descent step and 2) a hard thresholding step [28]. Our framework follows the iterative scheme of GHTP and integrates the approximate projection oracles head and tail in Graph-CoSaMP. Then we generalize GHTP from the trivial sparse optimization problem to a generic problem setup of graph block-structured optimization, and propose a framework to optimize structured data with multiple blocks, which can be deployed to detect subgraphs across interdependent networks. In this setting, structured optimization on single graph is a special case of our framework for multiple-block-structured optimization when the number of blocks is 1. VOLUME 8, 2020

C. NETWORK ANALYSIS
Complex systems can be modeled as complex networks, which usually encompass many subsystems that interact with or depend on each other. These networks composed of multiple interdependent networks are also known as multilayer networks, networks of networks or multiplex networks [29]. Studies in this field mostly study static and evolving statistical characteristics of networks, which ignore attributes on nodes and edges. Recent works [12], [30], [31] have studied the percolation properties of networks with interdependency on each other, which can be utilized to analyze the robustness of networks. In particular, [31] discussed the vulnerability of interdependent spatially embedded networks. Furthermore, epidemics in interdependent networks, which can depict disease transmission, were studied in [32], [33]. Apart from the percolation properties of networks, another network application is data transmission, which is critical in the era of the Internet. The references [34], [35] studied the fractional factor problem on fractional critical deleted graphs, which can help make better decisions for dividing large data packets into small packets and improve the digital communication efficiency. The molecular structures in microbiology or nanomaterials can be expressed as a network, where genes, proteins, cells or atoms are denoted as nodes, and the connected elements are regarded as edges. Then, researchers could calculate the topological indices of molecular structures, which are definitions from graph theory, to test the chemical, physical [3], biological [2], [15], and pharmaceutical [2] properties of various materials. These conclusions have promising application prospects in bioengineering and nanoscience.
There are many other research lines in this field. For example, a framework has been established to study the community structure of time-dependent networks, which handles various types of links (multiplexity) and multiple scales [36]. As a special network structure, ontology has attracted much attention. Researchers presented an efficient partial multidividing ontology algorithm to obtain a semantic matching set of concepts and rank them according to their similarities [37]. It must be noted that all of the studies in this part do not involve features on nodes or edges, which is the largest difference from our work.

III. PROBLEM FORMULATION
The problem (2) is nonlinear, combinatorial, and nonconvex. To make use of advanced numerical optimization techniques, such as the stochastic gradient descent and coordinate descent method, which have been proven to be impressively simple, efficient, and effective in nonconvex problems (e.g. deep learning) [13], we first reformulate the original combinatorial problem (2) as an equivalent 0-1 integer programming problem: min where x k denotes binary variables of nodes in the k th block, supp(x k ) refers to the set of indices of nonzero entries in x k , which represents a subset of nodes in block k, and M k represents all possible subsets of nodes that satisfy the topological constraint on graph G k . The set composed of the supports supp(x k ) refers to a subgraph whose corresponding variables x have values of 1 s and minimize the objective function. To make it easy to solve the problem (5) and take advantage of existing advanced numerical optimization methods, the domain can be further relaxed from x ∈ {0, 1} to x ∈ [0, 1] (i.e., from integer to continuous), and then the problem becomes a numerical optimization problem with graph-structured sparsity constraints that are nonconvex and combinatorial. We detail the formal problem setting in the following.
is the feature matrix defined on nodes, and w i ∈ R P is the feature vector of node i, the node set V has a multiple-block structure and can be decomposed to K disjoint subsets (blocks): | refers to the size of the node set of block V k . After relaxation of the domain from 0, 1 to [0, 1], the subgraph detection problem with multiple blocks can be formulated as the following general graph block-structured optimization problem: where the vector x ∈ R N is partitioned into multiple disjoint blocks is a continuous differentiable and convex function, supp(x k ) denotes the support set of vector x k , M k (G, s k ) denotes all possible subsets of vertices in G that satisfy a certain predefined topological constraint on block k. The functions f (·) and g(·) are defined based on the feature matrix W , and can be used to formulate the cost function and dependencies among blocks respectively.
One example of topological constraints for defining M k (G, s k ) is a connected subgraph, and we can formally define it as follows: where s k is an upper bound of the number of S and defined by users, S ⊆ V k , and G S refers to the induced subgraph by S. The topological constraints can be any graph-structured sparsity constraints on G S , such as connected subgraphs, dense subgraphs, and compact subgraphs [27]. Moreover, we do not restrict all supp(x 1 ), · · · , supp(x K ) satisfying an identical topological constraint. An illustration of the problem formulation for connected subgraph detection in interdependent networks can be found in Fig. 2.

IV. METHODOLOGY A. PRELIMINARIES
The relaxed problem (6) is hard to solve due to its nonconvex topological constraints. Intuitively, we could apply projected gradient descent to find an approximate solution, in which we first 1) optimize the objective function independent of the topological constraints, and then 2) project the intermediate solution to the feasible space that satisfies the topological constraints. The projection can be defined as (4). However, this trivial projection oracle is NP-hard for popular network-structural constraints. For example, for connected subgraphs, P(x) can be reduced from the prize collecting Steiner tree (PCST) problem, which is known to be NP-hard [25]. If we consider an approximation of the projection oracle P(x), the projected gradient descent algorithm becomes a heuristic algorithm with a slow convergence rate [25]. Fortunately, there exist some approximation methods for this NP-hard projection problem that provide the possibility to perform theoretical analysis. In the following, we introduce the approximate projection method that our method depends on. Note that any other approximate projection methods can also be applied to our framework as long as they provide a theoretical guarantee.

1) APPROXIMATE ALGORITHMS FOR THE PROJECTION ORACLE P(x)
There are two major components related to the support of the topological constraint ''supp(x) ∈ M(G, s)'', including head and tail projections [22]. The key idea is that, suppose we can find a good intermediate solution x that does not satisfy these topological constraints, these two types of projections can be used to find good approximations of x in the feasible space defined by M(G, s).
• Tail approximation (T(x)): Find an S ⊆ V such that where c T ≥ 1, s T = 5s, and x S is the restriction of x to indices in S: we have (x S ) i = x i for i ∈ S and (x S ) i = 0 otherwise. When c T = 1, T(x) returns an optimal solution to the problem: returns an approximate solution to this problem with the approximate factor c T .
• Head approximation (H(x)): Find an S ⊆ V such that x S 2 (9) where c H ≤ 1 and s H = 2s. When c H = 1, H(x) returns an optimal solution to the problem: max S ∈M(G,s) x S 2 .
When c H < 1, H(x) returns an approximate solution to this problem with the approximate factor c H . It can be readily proven that T(x) = H(x) = P(x) when c T = c H = 1. Although the head and tail projections are NP-hard when we restrict c T = 1 and c H = 1, these two projections can still be implemented in nearly linear time when approximate solutions with c T > 1 and c H < 1 are allowed. Moreover, the joint utilization of both head and tail projections is critical in the design of approximate algorithms for network topology-related optimization problems [22], [24]- [26]. It is claimed that these two approximations can be generalized to graph-structured sparsity models that are defined on different graph topological constraints, such as density, k-core, radius, cut, or various others, as long as their corresponding head and tail projections are available [26].

2) GRADIENT HARD THRESHOLDING PURSUIT
As mentioned in Section II, the gradient hard thresholding pursuit is an iterative method for the sparsity constrained convex optimization problem, which is defined as x 0 ≤ s In the iterative procedures, a sequence of intermediate s-sparse vectors x 0 , x 1 , . . . from an initial sparse approximation x 0 (typically x 0 = 0) are generated. At the i th iteration, the GHTP can be divided into three steps , this step applies the gradient descent at the point x i−1 with step size η; 2) i = P(supp(x i )), the s coordinates of the vector x i that have the largest magnitude are selected as the support; a vector that minimizes the objective function is returned. The first step of GHTP is a standard gradient descent; the second step gives a direction in which pursuing the minimization will be most effective; and the third step, often referred to as debiasing, has been shown to improve the performance in some algorithms [28]. These steps continue until the algorithm reaches a terminating condition, e.g., on the change in the objective function or the change in the estimated minimum from the previous iteration.
GHTP has proven its performance in optimization for the vanilla s-sparsity model. However this algorithm cannot handle optimization problems with graph block-structured constraints that are described in problem (6) due to the unavailability of the exact solution for the projection oracle in step 2) of GHTP. Meanwhile, the aforementioned head and tail approximations provide us with an effective and efficient method for achieving the approximation for the graph-structured sparsity model. Inspired by these Algorithm 1 Graph Block-Structured Gradient Hard Thresholding Pursuit (GB-GHTP) Input: Input graph G, maximum subgraph size s k on each block, and step size η. Output: The estimated vectorx and the corresponding connected subgraph S.
Head projection 4: end for 11: i ← i + 1 12: until halting condition holds 13: returnx = x i and S = G two algorithms, we generalize the GHTP algorithm to graph block-structured optimization by integrating head and tail approximate projections, and propose a novel framework named as Graph Block-structured Gradient Hard Thresholding Pursuit. We claim that GB-GHTP maintains the good optimization power as the GHTP method, efficiency and effectiveness of head and tail approximate projections, and provides a theoretical guarantee. The key idea of GB-GHTP is to alternatively search for a close-to-optimal solution by solving easier subproblems for graph G with K blocks in each iteration i until convergence. We will describe the GB-GHTP algorithm in detail and its theoretical properties in the rest of this section.

B. GRAPH BLOCK-STRUCTURED GRADIENT HARD THRESHOLDING PURSUIT
The GB-GHTP algorithm generalizes the graident hard thresholding pursuit algorithm to our problem, where multiple graph block-structured constraints are imposed on the variables. We outline the procedures of GB-GHTP in Algorithm 1. This algorithm also follows the scheme described in the gradient hard thresholding pursuit. We decompose it into three steps: 1) (Lines 2 ∼ 5) alternatively use head projection for partial derivative on each block and find a tentative gradient update to obtain a potentially good direction, in which pursuing optimization will be most effective; 2) (Line 6) optimizes the objective function in the union of sets k ; 3) (Lines 7 ∼ 10) alternatively apply tail projection to project the intermediate results to a feasible space, where the final results satisfy some topological constraints. Furthermore, we utilize the block-coordinate descent method with proximal linear update [38], [39] to solve the problem (10). This method has been analyzed and applied to both convex and nonconvex problems [40]- [42] and shows good performance empirically. Block-coordinate descent is a generalization of the alternating minimization method that has been applied to a variety of problems, such as the expectation-maximization (EM) algorithm [43]. In addition, we utilize the proximal linear update that ensures the convergence of the algorithm on convex problems with convex constraints ''supp(x k ) ⊆ k ''. The proximal linear update in our scenario is defined by (11) where α k,t serves as a step size and can be set as the reciprocal of the Lipschiz constant of ∇ x k F(x k,t ,x =k,t ), andx k,t is an extrapolated point that helps accelerate the convergence of the proximal point update scheme: where ω t ≥ 0 is an extrapolation weight. [44] suggests setting hyperparameters ω t+1 = (ρ t+1 −1)/ρ t+1 , with ρ 0 = 1, ω 0 = 0, and ρ t+1 = (1 + 1 + 4ρ 2 t )/2, to speed up the algorithm. The objective function in problem (11) is simply the second-order Taylor approximation of function F(·) with the Hessian matrix replaced by the identity matrix. We can easily derive and implement the closed-form solution of the objective function in problem (11) and then project it to the feasible space, which is convex. Mathematically, the solution of problem (11) is: where P(b) = argmin Note that the feasible set k is convex. The overall block-coordinate gradient projection method on a convex function with convex constraints (Algorithm 2) has a sublinear rate of convergence [39], [40].

C. THEORETICAL ANALYSIS
In this section, we analyze the theoretical properties of GB-GHTP. To guarantee the convergence of our framework and the accuracy of estimates, we require the objective function F(·) satisfying the weak restricted strong concvexity (WRSC) condition, which is a variant of the restricted strong convexity (RSC) [28] and defined as Definition 1 (WeakRestrictedStrongConvexity(WRSC)): For some ξ > 0 and 0 < δ < 1, a function F(·) has the weak restricted strong convexity if for any x, y ∈ R N and S ∈ M with supp(x) ∪ supp(y) ⊆ S, the following inequality holds: where x = (x 1 , . . . , x K ), y = (y 1 , . . . , y K ), x k , y k ∈ R N k , k = 1, . . . , K , topological constraint M can be expressed as M(G, s) =  Input: {G 1 , . . . , G K } Output: x 1,t , · · · , x K ,t Initialization: t = 0, = 10 −3 , ρ 0 = 1., ω 0 = 0. 1: repeat 2: Project x k,t+1 to feasible space by setting entries of x k,t+1 to zero if the index of entry is not in set k . 6: 9: Let We can set different s k values for the k th block. In our applications, s 1 = · · · = s k = s , i.e., we use the same upper bound of subgraph size for all blocks.
Theorem 1: Given a graph block-structured constraint with K blocks, M(G, s) = K k=1 M k (G, s k ) and an objective function F : R N → R satisfying the (ξ, δ, M(G, 5s))-WRSC condition. If η > 0, then for any x ∈ R N that satisfies some topological constraints, i.e., supp(x) ∈ M(G, s), the following inequality holds in the iterations of GB-GHTP (Algorithm 1) where Proof: The proof is provided in Appendix B

Remark 2: 1) The convergence of the GB-GHTP algorithm is controlled by the shrinkage rate α < 1, which is satisfied if and only if c 2
H > 1 − 1/(1 + c T ) 2 when δ is small. As proven in [25], the approximation factor c H of any given head approximation algorithm can be boosted to any arbitrary constant close to 1, such that the above condition is satisfied. 2) According to the WRSC condition, the function is Lipshcitz continuous, which further implies that (∇F(x)) I 2 must be bounded. In summary, Theorem 1 indicates the convergence of our algorithms.
Theorem 2: Suppose that x ∈ R N such that supp(x) ∈ M(G, s), and F : R N → R is an objective function satisfying the (ξ, δ, M(G, 8s))-WRSC condition. If α < 1, GB-GHTP returns an x such that supp(x) ∈ M(G, 5s) and x − x 2 ≤ c (∇F(x)) I 2 , where c = 1 + β 1−α is a fixed constant. In addition, GB-GHTP runs in time where T is the time of solving the subproblem in Line 6 of the GB-GHTP. Furthermore, if T scales linearly with N and |E|, then GB-GHTP scales nearly linearly with the network size N and |E|.
Proof: The proof is provided in Appendix C

Remark 3: We can run head and tail projections on blocks in parallel, which reduces the time cost of each iteration to (T + |E | log 3 N ), |E | log 3 N
= max k=1,...,K |E k | log 3 N k . As mentioned in the previous subsection, which was proved in [40], the sublinear convergence rate O(1/t) can be established for Algorithm 2. Hence, the iteration number of Algorithm 2 is O( 1/ ) when error bound is . Then, it concludes that the time complexity of Algorithm 2 (i.e., T ) is O ( 1/ N ), which further implies the GB-GHTP algorithm scales nearly linearly with N and |E|.

V. EXAMPLE APPLICATIONS
In this section, we show three applications of our framework: 1) evolving anomalous subgraph detection in dynamic networks, 2) anomalous subgraph detection in networks of networks, and 3) connected dense subgraph detection in dual networks. To apply our framework in these applications, we need to formulate these applications to the context of the graph block-structured optimization problem. Thus, we leverage the elevated mean scan (EMS) statistic to discover subgraphs whose features on vertices are anomalous/significant compared to those background vertices. The EMS statistic is used widely for detecting signals among node-level numerical features on graphs [17], [45] and is defined as where x ∈ {0, 1} N , w denotes the attribute vector of all nodes (here, we assume that each node only has one attribute). w i ∈ R refers to the univariable feature of node i. Empirically, the EMS statistic can be maximized to precisely detect significant subset of nodes in a network. To embed the EMS statistic into our optimization framework, relaxation on its input domain from {0, 1} to continuous space [0, 1] can be introduced. Then we can detect subgraphs by minimizing the negative relaxed EMS statistic, which is defined as Note that the negative relaxed EMS statistic satisfies the RSC condition, which implies the WRSC condition [26], [28]. Then Theorem 1 is established in our applications, i.e., the deployment of the negative relaxed EMS statistic guarantees the convergence of our framework.

A. EVOLVING ANOMALOUS SUBGRAPH DETECTION IN DYNAMIC NETWORKS
Dynamic networks arise in many applications and it is an important and challenging problem to detect subgraphs in dynamic networks, which is usually NP-hard [4]. Generally, attributes on nodes of a network evolve with time, and these evolving phenomena can be characterized by three phrases: emerging, spreading, and receding over a period of time.
The phrases usually represent an evolving event on networks (e.g., Fig. 1a). Assume a dynamic network spreads over K timestamps, i.e., we have K attributed networks (G 1 , . . . , G K ), k = 1, . . . , K . The evolving subgraphs refer to a consecutive sequence of subgraphs ( , which are connected (denoted as local connectivity constraint) at each timestamp and two subgraphs at consecutive timestamps share some overlap vertices (denoted as temporal consistency constraint) [4]. Then, the evolving anomalous subgraph detection in dynamic networks problem can be formulated as a nonconvex optimization problem with a convex objective function and block-structured constraints: where the first term is the summation of the negative relaxed EMS, the second term is a soft constraint to ensure the temporal consistency of detected subgraphs between timestamps k and k − 1, and λ > 0 is a tradeoff parameter. The final evolving subgraphs can be obtained from the support of x k , i.e., S k = supp(x k ).

B. ANOMALOUS SUBGRAPH DETECTION IN NETWORK OF NETWORKS
Our framework can also be applied to detect anomalous subgraphs in networks. A large-scale static network with many communities can be viewed as one instance of a network of networks (trivial interdependent networks), where each community is a small block of the network. When an event is widespread in such large networks, it becomes very challenging and time consuming to apply effective models to detect the significant subgraphs. For example, one application is rumor detection and tracking in social networks. It is interesting and useful to identify a connected subgraph that depicts how a rumor spreads across different communities in a social network. The most traditional approach is to partition a large network into several small blocks and then process them individually and independently. However, this approach affects the detection performance if those blocks are highly interdependent. By encoding the dependencies in our proposed framework, we can detect subgraphs in each individual partition of networks more efficiently without sacrificing performance.
Specifically, our proposed framework provides a feasible solution for this scenario where node dependencies among a large-scale network cannot be neglected. We can also leverage the relaxed EMS to detect signals on vertices. Then the subgraph detection problem in the network of networks can be formulated as follows: where the first term is the summation of the negative relaxed EMS, the second term is a soft constraint on bridge vertices of two networks (blocks) to ensure dependencies: if vertex i and vertex j are connected but in two different networks (i.e., edge (i, j) is a graph cut), e ij = 1; otherwise, e ij = 0.
x i and x j are i th and j th entries of x, and λ > 0 is a tradeoff parameter.

C. CONNECTED DENSE SUBGRAPH DETECTION IN DUAL NETWORKS
In real-life applications, there exist many dual networks, in which one network represents the physical world and the other network represents the conceptual world [46]. A typical example of dual networks is citation networks. In the research community, there are many collaborations between different researchers and some of them share similar research interests. Two different networks can be constructed in this context. One network models coauthor relationships, in which vertices are authors and edges represent that two authors coauthored one paper, i.e., physical interactions between authors. Another network models the research interest similarity between authors, in which an edge denotes that two authors have similar research interests (an edge can be constructed by measuring the similarity of publications of two researchers). The research interest network is conceptual. It is interesting to find a group of active researchers in which there exist collaborations between those researchers and their research interests are similar. Three goals are considered in this application. First, authors should be active, which can be detected by maximizing some metric defined on authors' features. Second, authors should have direct coauthorship or indirect coauthorship via other authors. This goal can be achieved by imposing a connected constraint on the coauthorship network. Last, we expect those authors to share similar research interests, which can be reflected by dense connections between authors on the research interest network. Here, we give a formal formulation of dual networks. Given two graphs G 1 (V, E 1 ) and G 2 (V, E 2 ) that have the same node set, i.e., V but have different edge sets, i.e., E 1 and E 2 .
For brevity, we also use G(V, E 1 , E 2 ) to represent the dual networks. The subgraphs induced by vertex set S ⊆ V in the physical network and conceptual network are denoted as G 1 S and G 2 S , respectively. Therefore, the connected dense subgraph detection problem in dual networks can be defined as follows Definition 2: Given dual networks G(V, E 1 , E 2 ), the connected dense subgraph detection in dual networks refers to finding a set of vertices such that their induced subgraphs satisfy: 1) G 1 S is connected; 2) the density of G 2 S is larger than a threshold α, denoted as ρ(G 2 S ) ≥ α;

3) the function defined on the network features is maximized.
Similar to the previous two applications, we use the EMS statistic to detect significant nodes. We can formulate this problem with a mathematical format and express it as an optimization problem with constraints on two networks min x,y − (w x) 2 x 1 +

end for
12: 16: where x and y are associated coefficients with the physical network and conceptual network respectively. The first and second terms are the negative relaxed EMS statistics of two networks, and the third term is a penalty to make the detected subgraphs in both networks as overlap as possible. ρ(G 2 (supp(y))) denotes the density of the subgraph induced by supp(y) in the conceptual network. To reduce the complexity of the problem, we assume that the corresponding nodes on both networks share the same features, i.e., the same w. In addition, we devise a parallel version of our framework (Algorithm 3) to speed up the computation by integrating the APPROX algorithm, a randomized coordinate descent method proposed in [47]. In Algorithm 3, the head projections (Line 2), tail projections (Line 16), and steps from Line 9 to Line 11 are parallelizable. The block from Line 4 to Line 15 can be run in parallel and boosts the convergence rate of solving problem (10) from O(1/t) to O(1/t 2 ), and the theoretical proof and more details are provided in [47]. The parallel part starts from x 0 ∈ R N (Line 4) and generates three vector sequences denoted as x t , y t , z t ≥ 0. In Line 6, y t is defined as a convex combination of x t and z t . In Line 7, a set of random blocks S t are sampled and then Lines 9-12 are performed in parallel.

VI. EXPERIMENTS
We evaluate our GB-GHTP framework in the aforementioned three practical applications on both synthetic and real datasets. The details of the experiments and the discussion of the results are reported in this section and are organized by applications. For each application, we perform experiments on both synthetic and real-world datasets. These real-world datasets are 1) water pollution data, 2) traffic data, 3) biological data, and 4) citation data, which cover many aspects of daily life and demonstrate that our framework has extensive applications.

A. EVOLVING ANOMALOUS SUBGRAPH DETECTION IN DYNAMIC NETWORKS 1) SYNTHETIC DATASETS
We construct network structures via the Barabási-Albert preferential attachment model [48], which is used to generate scale-free networks. In each synthetic instance, 7 networks with the same structure are generated and represent observations of a network at different timestamps. To simulate the dynamic process of attributes on a graph, we generate subgraphs on all networks one by one with random walks. The subgraphs of two consecutive timestamps must share 50% of node overlap, which characterizes the temporal dependency of node attributes at different timestamps. The univariate feature of each node in and not in the subgraph is generated in normal distributions N(µ, 1) and N(0, 1) respectively. Fifty instances are generated for each setting of µ = {3, 4, 5}. When µ is small, the signals on nodes have more noise, and it is more difficult to distinguish background nodes and anomalous nodes based on univariate features.
2) REAL-WORLD DATASETS 1) Water Pollution Dataset: a real-world network provided in the Battle of the Water Sensor Networks (BWSN) [49]. Among the nodes of the network, there were 25 nodes with chemical contaminant plumes, which are distributed in the network and produced a contaminated subarea. Each of nodes was associated with a sensor, which reports 1 when the node is polluted; otherwise, 0. The spread of pollution was monitored by these sensors for a period of 8 hours and reported for each hour. 2) Washington D.C. Road Traffic Dataset: a road traffic dataset collected from June 1, 2013 to March 31, 2014 in Washington D.C. [50]. We use the data from 6AM to 10PM with a time resolution of 60 minutes (17 timestamps).
3) Beijing Road Traffic Dataset: the dataset contains the real-time traffic conditions of four days. We use the data between 5PM and 7PM of the first day with a time resolution of 10 minutes (a totally 12 timestamps) [26]. For both traffic datasets, the difference between the reference speed VOLUME 8, 2020  and current speed is used as the node feature, and the true congested roads are provided. The statistics of all datasets are provided in Table 1. The column ''Nodes'' gives the number of nodes per graph in the dataset, and the column ''Edges'' describes the edge size. The columns ''Timestamps'' and ''Resolution'' describe the number of timestamps and the time resolution of each dataset.

3) COMPARISON METHODS
Our method is compared with two state-of-the-art baselines: Meden [21] and NetSpot [14]. Both algorithms prune the subinterval space to speed up and are proposed to detect significantly anomalous areas in dynamic networks.

4) PERFORMANCE METRICS
We use precision, recall, and F-measure to evaluate the performance of all approaches. A higher F-measure indicates better overall quality of the detected subgraph. For synthetic datasets, the reported results of each setting are averaged over 50 instances. Those metrics are defined as follows where S A refers to the node subset returned by an algorithm and S B denotes the ground set of nodes. Precision quantifies the accuracy of detected nodes that actually belong to the true subgraph. Recall reflects the coverage level of detected nodes for the true subgraph. The F-measure provides a single score that balances both the concerns of precision and recall in one number and is able to measure how well a model is.

5) RESULTS
The results of all methods on synthetic and traffic datasets are listed in Table 2 and Table 3. As shown in these two tables, our method outperforms both baselines on synthetic data and real-world data. Our approach has comparable precision to other baselines and outperforms them substantially on all datasets with the aspect of F-measure. Both baselines pruning the search space are heuristic, which cannot guarantee their performance and makes the results worse than ours. We further describe our settings on traffic datasets to make our results more intuitive. In traffic datasets, a road segment corresponds to an edge in the network, and intersections are nodes in the networks. The task is to find the most congested area in the road network, i.e., which road segments and intersections are congested. Here, we use the difference between the average speed and reference speed of a road segment as a feature on an edge. Then, the feature on a node is obtained by taking average of features of the edges connected to the node. Thus, a road segment with smooth traffic has a small speed difference. Cars in a blocked road segment have a low average speed, which makes the speed difference large. Thus, we obtain the traffic condition of a road network by monitoring the speed difference. Specifically, we define an objective function based on the speed difference and find the most congested area by maximizing the score. In such a scenario, a higher precision means that more nodes detected by an algorithm belong to the congested area, and a higher recall means that more nodes in the congested area are found in the results.

6) ROBUSTNESS ANALYSIS
We also analyze the robustness of these methods on the water pollution dataset. Noise with different levels is injected into the water pollution dataset. Specifically, K ∈ {2, 4, 6, 8, 10} percent of nodes are selected randomly and their sensor values are flipped (i.e., 0 to 1, or 1 to 0). The goal of the task is to detect connected subgraphs that correspond to the contaminated subarea at each timestamp. The results of robustness validation are reported in Fig. 3. The precision of Meden is lower than ours. NetSpot has a slightly higher precision performance. However, it decreases drastically on recall when noise increases. Our method has a more stable performance on all metrics, which indicates its better robustness.

7) PARAMETER TUNING
We use strategies recommended by authors of those approaches to tune parameters. For NetSpot, edges are weighted by comparing their p-values to a significance level threshold µ (0.01 recommended by the authors). To fit the input to NetSpot, weights of nodes can be easily averaged to obtain edge weights, which are used in the original paper [14]. It can be shown that problems with node weights and problems with edge weights are equivalent [21]. Our framework in this application has two parameters, 1) sparseness parameter s (an upper bound of the subgraph size on each block) and 2) tradeoff parameter λ. The partial data are  As in the previous application, we use the Barabási-Albert model to generate multiple networks with different network sizes and then apply the random walk algorithm to select 10% of nodes as the ground-truth subgraphs. The attributes of nodes in true subgraphs conform to a normal distribution N(5, 1), while the attributes of the background nodes follow normal distribution N(0, 1). Two synthetic datasets are generated to analyze the scalability in terms of nodes and edges, which are denoted as SynNode and SynEdge.
2) REAL-WORLD DATASETS 1) Beijing Road Traffic Dataset: we use static network data per timestamp from 5PM to 7PM in the previous experiment.
2) Wikivote Dataset 2 : a dataset contains all administrator elections and vote history data, which is extracted from the Wikipedia page edit history untill January 3, 2008.
3) CondMat Dataset 3 : the collaboration network is from the e-print website arXiv and covers collaborations between researchers who submitted papers to condense matter category. For the Wikivote and CondMat datasets, true subgraphs with 1,000 nodes are generated by a random walk, and the node attributes in true subgraphs follow the distribution N(5, 1); otherwise, N(0, 1). 4) DBLP Dataset 4 : the collaboration graph of authors from DBLP computer science bibliography. An edge between two authors represents at least one collaboration, and the attribute of a node represents the number of publications published by the author. The dataset covers records ranging in time from 1995 to 2005. We apply random walk to obtain subgraphs with 20,000 nodes and inject anomalies as our true subgraph, as suggested by [28]. Table 4 gives the statistics of all datasets in this application. For the SynNode dataset, we test graphs with different sizes, which have nodes from 1,000 to 10,000 and the edges are 3 times of the nodes. In the SynEdge dataset, we keep nodes unchanged, and test different sizes of edges from 300,000 to 1,000,000. The ''Blocks'' columns tell us how many subnetworks in a network of networks. The ''Processors'' column gives the number of processors we used in our parallel algorithm. We can see that our algorithm can be scaled up to large networks with hundreds of thousands of nodes and over one million edges.

3) COMPARISON METHODS
We use three baselines to validate the performance of our framework: 1) AdditiveGraphScan [20], 2) EventTree [7], and 3) LinearTimeSubsetScan (LTSS) [19]. AdditiveGraph-Scan uses the expectation-based binomial (EBB) statistic to detect anomalous subsets in graphs automatically. EventTree reformulates the connected subgraph detection problem in attributed networks as a prize collection Steiner tree (PCST) problem and applies Goemans-Williamson algorithm to solve it. The LTSS method uses the ''linear time subset scanning'' property of a function (Kulldorff's spatial scan statistic and extensions) to scan subsets and detect events.

4) PERFORMANCE METRICS
We use the same metrics (precision, recall and F-measure) to evaluate the performance of all methods. Additionally, runtimes of different methods are reported here to validate the efficiency of our framework.

5) RESULTS
Experimental results on all real-world datasets are listed in Table 5. From the comparison of different methods, we can   Fig. (b) shows that our framework can be easily scaled up to 1, 000, 000 edges, where node size |V| = 100, 000; by contrast, AdditiveGraphScan runs over 10, 000 seconds on all cases. see that our original (serial version) and parallelized methods both outperform all baselines in terms of F-measure except on the CondMat dataset. Although our original method is not efficient enough, it can scale to networks with hundreds of thousands of nodes and over one million edges as heuristic methods after parallelization. The reason why GB-GHTP cannot excel EventTree and LTSS in runtime is that our framework is an iterative method and requires more iterations. Despite more iterations to run, our framework provides a theoretic performance guarantee by compromising a small amount of runtime. The result of AdditiveGraphScan on the DBLP dataset is not reported since the method takes over one day to finish one run and thus is infeasible to tune its parameters.

6) SCALABILITY ANALYSIS
To analyze the scalability of different methods with respect to the number of nodes and edges, we evaluate these methods on synthetic datasets with different sizes. To run our method, we partition the static network into multiple blocks with METIS [51]. The runtimes of our framework compared with other baselines are reported in Fig. 4. In AdditiveGraphScan, a shortest path algorithm is used, which makes it not scalable to very large datasets. Since our method is an iterative algorithm, our serial method takes more time than some heuristic methods. However after it is parallelized, the runtime of our method can be reduced sharply (at least four times faster than the original serial version). Meanwhile, our framework can obtain comparable performance as those algorithms devised for specific applications. It is believed that our method could be more scalable if we utilize the computing resources rationally based on network properties.

7) PARAMETER TUNING
For all parameters used in AdditiveGraphScan, EventTree, and LTSS, we follow the same setting as [4], [19], [24]. To set parameters in our method, the same strategy as the previous application is used. The sparseness parameter in this experiment is set to be half of the size of a block. While we set the trade-off parameter to be 0.0005 in the SynNode, Wikivote and DBLP datasets, 0.001 in the SynEdge and CondMat datasets, and 0.0001 in the Beijing dataset. We run parallelized GB-GHTP on servers with multiple processors to speed up the algorithm, and more details are provided in Table 4, in which columns ''Blocks'' and ''Processors'' denote the number of subnetworks and processors used in our experiments.

C. CONNECTED DENSE SUBGRAPH DETECTION IN DUAL NETWORKS 1) SYNTHETIC DATASETS
We construct synthetic dual networks based on the Barabási-Albert model (shorthand for SynDual). Two networks with different densities are generated. In the synthetic dual networks, the edges to attach from a new node to existing nodes are 3 and 10 respectively. The subset of true vertices is selected with a biased random walk algorithm, which is applied on the first network. To simulate the dense area in the second network, the random walk algorithm considers the degree distribution of a node in the second network and accesses neighbor nodes that have a higher degree with higher probabilities. This biased strategy makes the generated subgraphs connected in the first network and dense in the second network. Then the univariate feature values of background nodes and true nodes are randomly generated in N(0, 1) and N(3, 1) distributions. The statistics about the generated dual networks can be seen in Table 6, which are averaged on 10 instances.

1) Homo Dataset:
We consider different types of genetic interactions for organisms in the Biological General Respository for Interaction Datasets (BioGRID, thebiogrid.org), a public database that archives and disseminates genetic and protein interaction data from humans and model organisms [52]. The Homo dataset concerns Homo sapiens, and this multiplex network makes use of 7 layers 5 . We extracted 2 of those layers as our dual networks, which represent colocalization and direct interactions among genes of Homo sapiens.
2) DBLP Datasets: We construct two dual networks from the DBLP dataset [53], one for the data mining research community and one for the database reseach community. For the data mining community, papers published in 5 data mining conferences are included (KDD, ICDM, SDM, PKDD, and CIKM) to construct the dual networks. The dataset contains 4,102 authors and 7,194 papers and is denoted as DBLP-DM. The first network is the collaboration network, in which authors are the nodes and an edge represents that two authors have coauthored a paper. The second network is the research interest similarity network among authors, which is generated based on the similarity of the terms in the titles of their papers. We use the shrunk Pearson correlation coefficient to compute the research similarity between authors [46]. The dual networks for the database community are similarly constructed based on papers published in 3 database conferences: SIGMOD, VLDB, and ICDE. This dataset is denoted as DBLP-DB, in which 4,402 authors and 6,087 papers are included. Note that for both datasets, only the top 30,000 positive correlations are introduced into the second network as edges. The aforementioned prepossessing for the DBLP dataset is the same as in reference [46]. The publications for an author are counted and used as univariate features of the author. Table 6 summarizes the statistics of all datasets.

3) COMPARISON METHODS
Two baselines are compared with our method: 1) DCS [46] and 2) EventAllPair+ [7]. DCS is designed for finding the densest connected subgraph in dual networks. However this method does not consider attributes on nodes at all. EventAllPairs+ algorithm finds a subset of vertices that have large total weights and are sufficiently compact. It considers attributes but only handles a single network and cannot guarantee connected and dense constraints.

4) RESULTS
We report performance of different methods on the synthetic and Homo datasets in Table 7. As can be seen, since our method considers the attributes on nodes and constraints imposed on dual networks, it has much better results on all metrics than other methods. Although the DCS can detect subgraphs that are denser than ours, it only reflects structural information while attributes on nodes are sometimes more important and help find more meaningful patterns.

5) CASE STUDIES
In addition to measuring the performance of various methods, we are also interested in the ability of our framework to infer meaningful patterns. Thus, we analyze some subgraphs detected by our method and DCS baseline on two real datasets DBLP-DM and DBLP-DB. The subgraphs detected by our method in the DBLP-DM dataset are shown in Fig. 5.
As you can see, in the collaboration network on the left, the detected subgraph is connected. Most importantly, our method can find some coauthor relationships between some influential researchers in data mining. When drawing the figures, we use a circle to represent an author and the radius of a circle is decided by the number of publications of the associated author. Then we can construct a collaboration network among those most influential researchers in the data mining field via our method. As shown in Fig. 5a, Jiawei Han and Phillip S. Yu are the two most influential researchers in the data mining community and they both published many papers and had a large number of collaborations with other researchers. It can also be seen in Fig. 5b, the more influential a researcher is, the more research interest similarity they share with other researchers, which is reflected by the phenomenon that a node with larger radius has denser connections with other nodes in the network. The subgraphs detected by our method on DBLP-DB are not drawn since there are many more nodes in the subgraphs and are difficult to visualize. Instead we list some statistics about our found subgraphs in Table 8 and describe some interesting results.
In the DBLP-DM dataset, although the density of the subgraph in the research interest network is higher than ours, the subgraph detected by the DCS method includes many authors whose publications are fewer than 5. From the publication distributions of authors in the detected subgraphs, we can see that all of the authors (Fig. 6a) discovered by our method published more than 10 papers in 5 data mining conferences, while more than 50% of the authors (Fig. 6c) discovered by the DCS method published fewer than 5 papers and nearly 30% of the authors published only 1 paper. The average publications of a researcher in the subgraph by DCS are 8.13, while ours are 25.03. Our method finds collaborations between more influential researchers. Obviously, it is more likely to track a hot research topic from more influential researchers rather than those who only published 1 paper. We observe a similar phenomenon from the results on the DBLP-DB dataset by comparing Fig. 6b with Fig. 6d. The average number of publications of researchers in the    subgraph detected by our method is 22.71. The average number of publications of researchers in the subgraph by DCS is 10.63. These two cases prove that it is not enough to detect subgraphs that are densest and connected and it matters more for us to consider attributes on nodes, rule out those nodes that are not that important, and further detect some more significant patterns.
The results of EventAllPairs+ are not listed because the algorithm cannot guarantee connectivity in the collaboration network and are unexplainable for their implications.

6) METRICS AND PARAMETER TUNING
We use the same strategies as the previous two applications to evaluate the performance of the methods and decide the parameters in our method. The parameters used in DCS are recommended by the authors of the original paper (γ = 1.5). The parameter λ used in EventAllPair+ is selected from the settings on the training dataset, which evaluates 1,501 and 901 on the SynDual and Homo datasets, respectively As mentioned before, the parameters on DBLP-DM and DBLP-DB are not listed because the results of the experiments are unexplainable. The sparseness parameter s and tradeoff parameter λ in GB-GHTP are set to be 50/1.0, 250/1.0, 300/10.0 and 300/10.0 on the SynDual, Homo, DBLP-DM and DBLP-DB datasets.
Implementation All experiments were conducted on 64-bit machines with Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz and 251 GB memory.

VII. CONCLUSIONS AND FUTURE WORK
This paper presents a graph block-structured optimization based framework, to detect subgraphs in attributed interdependent networks, which runs in nearly linear time with respect to the network size and provides a theoretical guarantee. A parallel version of the framework is proposed to improve its scalability. We evaluate our framework on three applications. The results on both synthetic and real-world datasets indicate that our framework enjoys better effectiveness and efficiency than other baselines. Additionally, our framework is not designed for a specific problem and can be applied to more scenarios. For future work, we will deploy our framework to more applications and networks with multiple attributes. It is also worth exploring more powerful objective functions to capture interesting patterns on attributed networks. the total number of iterations is log x 2 (∇F(x)) I 2 / log 1 α , the overall time follows Theorem 2.