Schatten Graph Neural Networks

Graph Neural Networks (GNNs) have been intensively studied in recent years because of their promising performance over graph-structural data and have provided assistance in many fields. Recalling recent works on graph neural networks, we found that imposing graph smoothing via Frobenius norm was proven to be effective in the architecture of graph neural networks from the standpoint of the graph signal processing. In this paper, we aim to model the graph smoothness of graph neural networks using a Schatten <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula>-norm with <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula> in the interval <inline-formula> <tex-math notation="LaTeX">$[1,2$ </tex-math></inline-formula>) to characterize smoothness and propose a novel architecture called Schatten graph neural networks. This architecture stems from a primal-dual solution scheme for a graph signal denoising problem. There is difficulty in solving subproblems with respect to the Schatten <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula>-norm. We propose a fixed point iteration scheme and prove that it tracks with the linear convergence rate with solid mathematical analysis. Extensive experiments demonstrate the effectiveness of the proposed architecture of graph neural networks and their robustness to the graph adversarial attacks.


I. INTRODUCTION
Although deep neural networks have seen great development in recent years, they have no ability to treat irregular data, such as that in social networks [1], protein-protein networks [2] and traffic networks [3]. Graph neural networks (GNNs) [4] have been one of the most popular tools to deal with this kind of data and can learn powerful representation from graph-structural data. GNNs can be used in many tasks including node classification [5], link prediction [6], graph classification [7] and recommendation systems [8] and many others [43]- [46].
Based on the mode of local computation, GNNs can be roughly divided into two classes: graph convolution networks [5] and message passing networks [9]. The graph convolution stems from the convolution in deep neural networks which operate on regular graphs, such as image, text and videos. Graph convolution can deal with an irregular graph and capture local information to generate better representations. Mathematically, the k-th graph convolutional The associate editor coordinating the review of this manuscript and approving it for publication was Hong-Mei Zhang . layer is = ϕ k ( LX (k−1) W (k) ), where X (k−1) is the (k − 1)-th layer representation, W (k) is the feature transformation matrix and ϕ k is the activation function. Graph convolutional network (GCN) [5] and graph attention network (GAT) [10] are two classical graph convolution networks. GCN adopts the spectral convolution methods and simple approximation of a Chebyshev polynomial. GAT extends the attention mechanism from deep neural networks to graph neural networks. Message-passingbased GNNs follow from the old message passing algorithms and represents the shared functions by means of graph neural networks. Mathematically, popular massage-passing networks [9] can be unified by = g(C T out X (k−1) ), Y (k) = C in Y (k) , where the first equation is k-step message computation, the second equation is the k-step message aggregation and the third equation is the k-step node state update.
We examine message-passing networks in this paper. Recently, it was shown in [11] that many such networks have an affinity with graph signal denoising problems with l 2 graph smoothness. This includes a second-order approximation term and Laplacian regularization term. As a matter of fact, these networks can be deduced from the graph signal denoising problem by using different formats of optimization schemes, for example, a gradient descent algorithm with different step sizes. ElasticGNN [12] attempts to improve the smoothness by both l 1 -and l 2 -based graph smoothing. Further, it considers the l 21 -and l 2 -based graph smoothing schemes.
In this paper, we propose to establish a graph signal denoising problem with the Schatten p-norm for p in the interval (0, 2). It is well-known that the rank function is the p-order power of the Schatten p-norm at p = 0. When p approaches zero, the low-rank property emerges. In practice, the nuclear norm (i.e. p = 1) is often used as a convex surrogate of rank function for the convenience of optimization. When p lies in the interval [1,2), the p-order power of the Schatten p-norm is finite and convex. In this case, the sparsity occurs when we take an appropriate regularization coefficient. Note that the Schatten p-norm with p in the interval [1,2) dominates the l 2 norm. As a result, the proposed graph signal denoising model incudes the smoothness between l 2 and the Schatten p-norm with p ∈ [1, 2). When p is greater than 2, the graph signal denoising model just characterizes the l 2 smoothness because the l 2 norm is the upper bound of the Schatten p-norm for p ≥ 2. This is why we restrict the value of p in the interval [1,2).
We also discuss the optimization of the graph signal denoising problem with a Schatten p-norm for p in the interval [1,2). Note that when p ≥ 1, the p-order power of the Schatten p-norm is convex. With this good property, the objective function in the graph signal denoising problem is the composition of two convex functions. We choose the modified Proximal Alternating Predictor-Corrector (PAPC) optimization scheme [13] to find a solution, which actually belongs to a primal-dual solution scheme. With a special choice of step sizes in PAPC, we obtain a novel architecture of graph neural networks.
In the PAPC schemes, there is difficulty in solving the subproblem of the proximal operator of the Fenchel conjuagte with respect to a convex function (i.e., the scaling p-order power of the Schatten p-norm). By Moreau decomposition, we just need to solve the proximal operator of the scaling p-order power of the Schatten p-norm. We propose an efficient fixed-point iteration scheme. Theoretical analysis shows that this scheme has a linear convergence rate O(ρ k ) with some ρ ∈ (0, 1).
The robustness under a graph adversarial attack is also examined in our experiments. By setting different attack ratios, the performance is recorded and reported. This reveals the effective robustness of the proposed graph neural networks.
As summarization, the contribution of this paper is as follows.
(1) We propose a novel architecture of graph neural networks called Schatten graph neural networks from the standpoint of graph signal processing, in which, Schatten p-norm is employed to characterize the smoothness. When p ∈ (1, 2), it gives rise to the mixture of low-rank and l 2 smooth property.
(2) The convergence of the proposed message passing schemes is theoretically proved. In particular, this scheme contains a subproblem of solving proximal operator with respect to the Schatten p-norm. We develop a fixed point iteration algorithm and prove that it bears with linear convergence rate.
The remainder of this paper is organized as follows. In Section II, related works are briefly reviewed. In Section III, we give the problem formulation and notations that are used in the latter sections. In Section IV, we set up the methodology. In Section V, we propose the graph neural network architecture. In Section VI, convergence analysis is performed. Complexity analysis is provided in Section VII. Extensive experiments are conducted in Section VIII. Finally, we conclude this paper in Section IX.

II. RELATED WORKS A. GRAPH SIGNAL PROCESSING
The popular architectures of graph neural networks including GCN [5] and GAT [10] can be implicitly obtained by using a gradient descent algorithm to solve the following graph signal denoising problem with a particular step: where X input is the input signal, A ∈ R n×n is the symmetric adjacency matrix whose entries are A ij , x[i, :] is the i-th row instances and λ is a positive trade-off parameter. The last term indicates the global l 2 smoothness. Recently, Elastic GNNs [12] were proposed to improve the smoothness. GNNs include l 1 -and l 21 -level smoothness, which induces a better sparsity of graph signals. The l 1 smoothness is characterized by where ∈ {−1, 0, 1} m×n is the oriented incident matrix, where m is equal to |E| and each row is like (0, 0, −1, 0, · · · , 0, 1, 0, 0), where the nonzero terms denote two nodes with directed edges. The l 21 can induce the row sparsity of X . This provides more precise sparsity than the Frobenius norm. The l 21 smoothness is characterized by tr(X T LX ), (4) VOLUME 10, 2022

B. GRAPH ADVERSARIAL ATTACK
An adversarial attack on graph structured data [14] is an active field of graph learning. For a given graph structured dataset D = (c j , G j , y j ), if we change G j as G j a little such that the adversarial samples G j and G j become similar under some specified metrics, the performance of the graph task is worse than before. This is the general adversarial attack problem over graph data. There exist many works on graph adversarial attacks including [14] and [15]. To effectively generate adversarial samples, nodes or edges can be slightly changed and the similarity can be achieved by some perturbation evaluation metrics. As in [14], the imperceptible perturbation can be roughly categorized into four classes: node-level perturbation, edge-level perturbation, structure preserving perturbation and attribute preserving perturbation.

C. LOW-RANK MATRIX MINIMIZATION
The low-rank approximation model is formulated as However, this optimization problem is difficult to solve. Then nuclear norm is considered an approximate scheme because the nuclear norm is is the convex envelope of rank on the unit ball of matrix operator norm [16]. Hence, one can get a relaxed version as where the nuclear norm is The low-rank matrix approximation of a given matrix Y ∈ R n×d with the Schatten p-norm is described as where L is the loss function, p > 0 and It includes nuclear norm (p = 1) and Frobenius norm (p = 2) as two popular examples. Apart from the power form of eigenvalues, a more general nonconvex and nonsmoothness low-rank minimization is summarized in [17].
where λ ≥ 0 is a nonnegative controlling parameter. The usual examples of penalty include L p [18], SCAD [19] and Laplace [20]. L p penalty is The SCAD penalty is described as The Laplace penalty is

D. OPTIMIZATION
Smooth optimization [22] and non-smooth optimization [21] have been widely studied in machine learning fields. Some optimization problems can be decomposed as the sum of convex smooth and nonsmooth components. Mathematically, where f and g are convex functions but f is smooth function. For the convex function g, its Fenchel conjugate is Then, we can obtain the equivalent saddle point problem Candidate algorithms such as Alternating Direction Method of Multipliers (ADMM) [23] and Newton type [24] may work. These candidate algorithms may contain the task of finding the solution to some nontrivial sub-problem with a heavy computation burden. Intermediate optimization problem-solving may be incompatible with the standard back-propagation (BP) algorithms in general deep learning. The Proximal Alternating Predictor-Corrector (PAPC) [13] is a kind of primal-dual optimization algorithm that has been proven to be effective in the recent work ElasticGNN [12].

III. PROBLEM FORMULATION AND NOTATIONS
Let G = (V, E, F) be a graph. V is the vertex set, F is the collection of features of nodes and E is the edge set. E = {e 1 , · · · , e m } can be represented by a matrix A ∈ R n×n called adjacency matrix for the graph G, where n is the number of vertices. If node v i and v j in the vertex set V are connected, then set A ij = 1; otherwise, set A ij = 0. E can also be considered in another way: the edge set is characterized by the incident matrix where i is like (0, 0, −1, 0, · · · , 0, 1, 0, 0) ∈ R n , in which −1 indicates the initial point of edge e i , and 1 denotes its terminal point. 56484 VOLUME 10, 2022 Let A = I + A, where I is the identity matrix of order n. Actually, A is the self-looped version because every node in V is self-connected. Let D be the degree matrix w.r.t A and L = D − A is the normalized Laplacian matrix. Given any matrix B, σ i (B) denotes the its i-th largest singular value. For , y l }, and Y denotes the label space. With this learned model, the predicted label of each unlabeled node is produced. We also consider the robustness of the proposed approach under graph adversarial attacks.

IV. METHODOLOGY
In this paper, we propose modelling the smoothness by the Schatten p-norm which is different from the Elas-ticGNN [12]. For any p > 0 and Z ∈ R n×d , Mathematically, we consider the following graph signal denoising model.
where X input is the initial graph signal, p is a positive number, λ 1 > 0 and λ 2 > 0 are two trade-off parameters, and In this model, λ 1 forces the l 2 -smoothness. λ 2 forces the property induced by the Schatten p-norm, i.e. low-rank property, l 2 -smooth property or their mixture, which depends on the value of p. We interpret this model in detail and devise a novel graph neural networks later. The first term in the objective makes X approximate to X input . The tr(X T LX ) in the second term can be expanded as where x[i :] and x[j :] denote the i-th and j-th row vectors of X , respectively. This reveals that the l 2 sparsity is directly implied by the second term in (19). By choosing proper parameter λ 1 , two connected nodes draw closer to each other in the search space, which is similar to the manifold regularization [25]. The last term X p S p may have diverse meanings. It depends on the value of p.
There exist some particular cases as follows. Recall that when p tends to zero, for any Z ∈ R m×d , we have If p is equal to 1, then the Schatten 1-norm is the standard nuclear norm, i.e.
If p = ∞, then the Schatten ∞-norm is operator norm, namely, According to [26], we have the following inequalities, namely, Set Z = X . It is readily seen from the meaning of that every line of Z is of the form if there exists a directed edge e ij ∈ E from node i to node j. For an undirected graph, each edge can be decomposed as two directed edges with opposite orientation. In other words, if nodes v i and v j are connected, then two directed edges E i→j and E j→i emerge.
When p is small, the last regularization term in (19) induces the low-rank property of X . When p ≥ 2, Z S p ≤ Z S 2 , which implies that the second term in (19) dominates the last term. In this case, the model (19) is just l 2 -smooth. It is well-known that the Schatten p-norm is convex when p ≥ 1, which is friendly to the usual optimization strategy. Therefore, we restrict the value of p in the interval [1,2]. When p ∈ [1, 2], we have Z S 2 ≤ Z S p ≤ Z S 1 , which means that the last term in (19) induces a mixed property between the low rank and sparsity of X . In the next section, we propose a novel graph neural network architecture from the optimization of (19).

V. THE PROPOSED ARCHITECTURE OF THE GRAPH NEURAL NETWORKS A. REFORMULATION AS SADDLE POINT PROBLEM
The procedure is displayed in Figure 1. The method of solving optimization (19) is the key to deducing the Schatten graph neural networks. To solve problem (19), we consider its saddle point formulation. Let where VOLUME 10, 2022 FIGURE 1. The illustration of the procedure. The graph neural networks can be constructed by the inspiration of the optimization for solving graph signal denoising problems. For example, the GCN [5] can be regarded as a gradient descent scheme with a particular stepsize [12]. Analogously, our proposed Schatten graph neural networks can be derived by a optimization scheme (i.e. PAPC scheme). and With these notations, the optimization problem (19) can be written as Then (19) can be further reformulated as a saddle point problem B. OPTIMIZATION SCHEME Following [12], we use the modified Proximal Alternating Predictor-Corrector (PAPC) [13] scheme as follows: Equation (35) involves the proximal operator of Fenchel conjugate g * . This is not easy to compute directly. By employing Moreau's decomposition [27], we have If we can solve the proximal operator of g, the problem with (35) is readily solved.
Recall that g(Z ) = λ 2 Z p S p . Then the proximal operator with respect to g is By [39], it is closely related to In Section VI, we provide an efficient fixed-point iteration algorithm to solve this problem with strong convergence analysis. Once prox β −1 g (Z ) is solved, by (37), C. NETWORK ARCHITECTURE Based on the selected optimization scheme, we deduce a novel message-passing-mechanism based graph neural network architecture. Let where the function G(·) comes from Eq. (29). The first-order derivative of G is Inserting (40) into (39), we have Combining (34)(35)(36) and (42), the optimization scheme can be formulated as It is sufficient to take γ = 1 1+λ 1 and β = 1 2γ for convergence (This will be proved in Theorem 1.). With this in mind, we have the following scheme: if X (k) and Z (k) are regarded as the node's embedding of the k-th layer and connection parameters, then we can construct a graph neural network by stacking layer by layer. For intuition, we change (45) as the language of graph neural networks and provide the network architecture in Figure 2.

VI. CONVERGENCE ANALYSIS
In this section, we provide the iteration scheme of (45) with convergence guarantee in Theorem 1. Since there is a subproblem of solving proximal operator in (45), which is related to (39). We propose a foxed-point iteration scheme to solve (39) with the analysis of linear convergence rate. Lemma 1: Let u(W ) = W p S p be the function from the graph signal denoising model (19). The function u(W ) = W p S p is convex for p ≥ 1. Proof: Recall from [40] that the Schatten p-norm • S p is convex for p ≥ 1. Note that l(t) = t p is convex over {t ∈ R : t ≥ 0} for p ≥ 1 because the second-order derivative of l satisfies l (t) = p(p − 1)t p−2 ≥ 0. The function u can be decomposed as the composition of l and • S p , i.e. u(•) = l • • S p . Since the composition of two convex function is still convex, u(•) is convex.
Proof: By Lemma 1, the function u is convex in the graph signal denoising model (19). The objective function in (19) is the sum of f (X ) and u( X ), where is a bounded linear operator. The gradient ∇f satisfies Lipschitz condition. By [41] and [42], the iteration scheme (45) is convergent under γ < 2 L and β < 4 3γ λ max ( T ) , where L is the Lipschitz constant. L can be computed as L = λ max (∇ 2 f (X )) = 1 + λ 1 L 2 . For a matrix R, let R 2 denotes its spectral norm. Note that L 2 = T 2 = T 2 ≤ 2. Hence the given values of λ and β satisfy respectively. Note that there exits a subproblem of solving proximal operator. The key is to solve (39). We find that (39) can be efficiently solved by fixed point iteration. As a matter of fact, for a sufficiently small positive number ε, the (39) has an identical solution to In practice, we find that ε = 0.1 works well. We propose the fixed-point iteration as follows: where The following theorem provides the global convergence analysis with a convergence rate.
where the first equality holds by the (47) and the second equality is true by the meaning of fixed point. Now we need show that is indeed the solution to (46). The second-order derivative of h is Note that  Then h (δ) ∈ (1 − ρ, 1 + ρ).
The initial value is set as 10. The minimizer is δ = 9.8084. We display the iteration process in Figure 3. The iteration converges with very few iteration. This examples empirically show the efficiency of the proposed iteration scheme.

VII. COMPUTATIONAL COMPLEXITY
In this section, we provide the analysis of computational complexity with respect to the iteration Eq. +mn)d +md 2 + min{m, d}T max )K max ).

VIII. EXPERIMENTS
In this section, extensive experiments are performed to verify the effectiveness of the proposed Schatten graph neural networks. We state the used datasets and baselines. The parameter setting strategy is also formulated in detail. The robustness of the proposed Schatten graph neural networks is considered. We also provide an ablation study to measure the impact of parameters.

A. DATASETS AND BASELINES
We conduct experiments on eight graph-structural datasets that contain three citation graphs (Cora, Citeseer, Pubmed [29]), two co-authorship graphs (Coauthor CS and Coauthor Physics [30]), two co-purchase graphs, (Amazon Computers and Amazon Photo [30]), and one blog graph (Polblogs [31]). In the Polblogs graph, node features are not available and we specify the feature map as an identity matrx. For all of these datasets, the statistics such as classes, edges and features are displayed in Table 1.
The baselines are selected as some recent approaches, such as GCN [5], GAT [10], SGC [32], GraphSAGE [33], APPN [34], ElasticGNN [12] and EigenGCN [35]. For fair comparison, a two-layer network architecture with 64 hidden dimensional representations is adopted in all models. This setting strategy follows the Elastic GNN [12]. We choose the classification accuracy as the comparison criterion of the performance.

B. PARAMETER SETTING AND SUMMARY
The average performance together with the standard variance within 10 runs is reported in Table 2. The learning rate is selected from {0.05, 0.01, 0.005}. The weight decay is tuned over the set {5 × 10 −4 , 5 × 10 −5 , 5 × 10 −6 }. The Adam optimizer is used in our experiments. The choice of optimizer is based on our experience. We found that Adam optimizter typically returns a better local minimizer than SGD optimizer in experiments. The dropout rate lies in the set {0.5, 0.8}. The hidden dimension of node embedding is fixed as 64. The number of layers of the proposed Schatten GNN is    The proposed Schatten GNN is derived from graph signal processing problem Eq. (19) with a specific optimization scheme Eq. (45). In fact, all the chosen comparison methods in Table 2 can be derive from a kind of graph signal processing problems with different regularizers and optimization schemes. The difference between the proposed Schatten GNN and recent Elastic GNN [12] is the last regularization in Eq. (29). The experimental result reveals that the mixed property of low rank and l 2 smoothess will makes the performance of graph neural networks stronger than just l 2 smoothness. All experiments are conducted on 1 Tesla V100 GPU. The average running time is within few minutes per task.   curves are shown in Figure 4-7. The performance becomes better as the λ 1 takes larger values. The cause is that the larger λ 1 imposes stronger constraint of l 2 sparsity. When λ 2 takes small value, the performance is good. when p approaches to 2, the Schatten p-norm is far from low-rank property and close to l 2 smoothness but is looser. In this case, the performance of the proposed approach achieves the best.

D. ABLATION STUDY
We perform an ablation study as follows. Aiming at the graph signal denoising problem with the Schatten p-norm, we set λ 2 = 0 and observe the performance. In this case, the graph signal denoising reduces to the l 2 normbased graph smoothing. We show the performance change of the proposed approach along with the number of layers K in Figure 8.

E. ROBUSTNESS UNDER GRAPH ADVERSARIAL ATTACK
The robust performance of the proposed Schatten GNN under an adversarial graph attack is examined. The attack harms the GNN model's performance by slightly modifying the underlying graph structure. We adopt the MetaAttack [36] from DeepRobust [37], which is a PyTorch library for adversarial attacks and defenses, to generate graphs of adversarial attack for Cora, CiteSeer, Polblogs and PubMed. We randomly split 10%/10%/80% nodes of these data into a training set, validation set and test set, respectively. The statistics of the modified graph are listed in Table 3. By the works [36], [38], the largest connected component (LCC) is used in the adversarial graphs. We are only concerned with the robustness in the sense of l 1 -based graph smoothing regardless of adversarial defense. The experimental results are listed in the Table 4. The robustness of the proposed approaches is best. This may be attributed to the balance of the Schatten p-norm for p ∈ (0, 2) and l 2 norm-based graph smoothing.

F. SUMMARIZATION AND ANALYSIS
The GCN [5] can be regarded as the base architecture in experiments. In fact, it can be generated by taking λ 2 in the model (19) and performing a gradient scheme with particular stepsize [12]. The proposed network architecture of Schatten GNN is determined by the graph signal denoising problem (19). In this sense, GCN is just a special case of Schatten GNN when λ 2 = 0.
Based on the experimental results in Table 2 and Table 4, the proposed Schatten GNN achieve the best performance when compared with recent works on GNNs. By [26], the Schatten p-norm is between Frobenius norm and nuclear norm when p lies in the interval [1,2). The Frobenius norm induces the l 2 sparsity. The nuclear norm has an affinity with low-rank property. Hence the Schatten p-norm indicates the mixed property of l 2 sparsity and low-rank property. Elastic GNNs [12] impose l 1 sparsity. Hence Elastic GNNs consider mixed property of l 1 and l 2 sparsity. We can see that the performance of the proposed methods outperforms the Elastic GNNs a little. The increment of performance is small. This is probably that the low-rank property of the signals is a bit better than the property of l 1 sparsity.

IX. CONCLUSION
Message passing networks are some of the most important graph neural networks. In this paper, we proposed a novel message-passing networks, called the Schatten graph neural network, which is derived from a new proposed graph signal denoising problem with the Schatten p-norm. There is difficulty in solving the proximal operator in the intermediate steps, and we proposed a novel fixed-point iteration scheme for which the linear convergence rate O(ρ k ) was theoretically proved. Extensive experiments indicated that the proposed approach outperforms the state-of-the-art approaches and is robust under graph adversarial attacks. JIAWEI ZHANG is currently pursuing the master's degree with the College of Informatics, Huazhong Agricultural University, China. His current interests include machine learning and bioinformatics.