Combinatorial Learning of Robust Deep Graph Matching: An Embedding Based Approach

Graph matching aims to establish node correspondence between two graphs, which has been a fundamental problem for its NP-hard nature. One practical consideration is the effective modeling of the affinity function in the presence of noise, such that the mathematically optimal matching result is also physically meaningful. This paper resorts to deep neural networks to learn the node and edge feature, as well as the affinity model for graph matching in an end-to-end fashion. The learning is supervised by combinatorial permutation loss over nodes. Specifically, the parameters belong to convolutional neural networks for image feature extraction, graph neural networks for node embedding that convert the structural (beyond second-order) information into node-wise features that leads to a linear assignment problem, as well as the affinity kernel between two graphs. Our approach enjoys flexibility in that the permutation loss is agnostic to the number of nodes, and the embedding model is shared among nodes such that the network can deal with varying numbers of nodes for both training and inference. Moreover, our network is class-agnostic. Experimental results on extensive benchmarks show its state-of-the-art performance. It bears some generalization capability across categories and datasets, and is capable for robust matching against outliers.


INTRODUCTION AND PRELIMINARIES
G RAPH matching (GM) aims to solve the problem of finding node correspondences over two or multiple graphs. It incorporates both node-wise unary similarity and edgewise [1], [2] (or even higher-order [3], [4], [5]) similarity to establish a matching, in order to maximize the similarity between the matched graphs. By encoding the structural information in the objective, graph matching can often achieve more robust performance against disturbance. In contrast, the point based methods e.g. RANSAC [6] and iterative closet point (ICP) [7] do not explicitly account for such edge-to-edge information. For its expressiveness, graph matching has lied at the heart of many computer vision applications [8] such as action recognition, robotics, visual tracking, weak-perspective 3-D reconstruction. Refer to [9] for a more comprehensive literature review.
As a classic combinatorial problem, graph matching is known in general NP-hard [10], which has been addressed mostly by approximate techniques leading to inexact solutions. Consider the classic setting of two-graph matching between G 1 , G 2 , it can be generally expressed by the quadratic assignment programming (QAP) form: where X is a binary permutation matrix encoding the node correspondence, and the so-called affinity matrix K 2 R N 1 N 2 ÂN 1 N 2 encoding the node-to-node and edge-to-edge affinity between two graphs, by its diagonal elements and off-diagonal ones, respectively. To calculate K, a practical form is the Gaussian kernel by K ia;jb ¼ exp ðf ij Àf ab Þ 2 s 2 where f ij denotes the feature vector of edge ij. Note the node similarity can also be encoded given the node index ia ¼ jb holds. In particular, Eq. 1 is called Lawler's QAP [11], which can incorporate other more specific forms e.g. Koopmans-Beckmann's QAP [10]: where F 1 2 R N 1 ÂN 1 , F 2 2 R N 2 ÂN 2 are weighted adjacency matrices of graph G 1 , G 2 respectively. Matrix K p denotes the node-to-node affinity. The connection to Lawler's QAP becomes clear by setting K ¼ F 2 F 1 . Another popular formulation is the so-called factorized graph matching model [12], which shows how to factorize affinity matrix K as a Kronecker product of smaller matrices. For concise, here we write the undirected version [12]: where H k ¼ ½G k ; I n k 2 f0; 1g n k Âðm k þn k Þ ; k ¼ i; j where n i and m i is the number of nodes and edges in graph G i respectively and is the Kronecker product operation between matrices. K p 2 R n i Ân j denotes the node affinity matrix, and K q 2 R m i Âm j for the edge affinity matrix. The graph structure is specified by the node-edge incidence matrix G 2 R nÂm such that the non-zero elements in each column of G indicate the starting and ending nodes in the corresponding edge. The factorization provides a taxonomy for GM and reveals the connection among several methods. Readers are referred to [12] for greater details.
In parallel, recent efforts also try to explore higher-order affinity information which is beyond the second-order. Tensor marginalization is a popular technique as widely used by hypergraph matching works [5], [13], [14], [15]: x Ã ¼ arg maxðH 1 x 2 x . . . m xÞ s:t: where m is the order of the affinity model, and H is the m-order affinity tensor which encodes hyperedge affinity. While k is the tensor product [3]. Readers are referred to Section 3.1 in [14] for details on tensor multiplication. In particular, these works all assume the affinity tensor is invariant regarding with the index of hyperedge pairs, to facilitate the optimization. Meanwhile, there are works on multi-graph matching, aiming to establish node correspondence among multiple graphs. Two typical methodologies have been developed: i) transform the multi-graph problem into a two-graph matching QAP in each iteration [16], [17]; ii) obtain the putative matchings by first solving a two-graph matching problem for further smoothing in post-processing step [18], [19], [20], [21]. Compared with two graph matching, the advantage of joint matching of multiple graphs lie in that the information across graphs can be fused to resist outliers and noises.
Moreover, for all the cases of two-graph, hyper-graph and multi-graph matching, it has been a fundamental requirement for devising appropriate node/edge feature extractor and affinity model to transform the real-world matching task into an equivalent mathematical model, in the sense that the optimal solution corresponds to the meaningful and correct matching. However, the challenge lies in that the manually designated models (e.g., a Gaussian kernel with Euclid distance) may not be expressive enough to fit with the problem and data at hand, especially in the presence of outliers and noises.
Seeing these issues, recently a thread of works aim to learn the affinity model for graph matching, to improve the flexibility and model capacity for affinity modeling. The hope is that the bias introduced by the manually predefined affinity model can be effectively dismissed and the mathematically optimal solution (according to the affinity model and the corresponding objective) can truly reflect the meaningful correspondence. In deed, such learning based works, which are the focus of this paper, are somehow orthogonal to those focusing on devising effective solvers given the predefined affinity model [1], [2], [3], [14].
In summary, we have devised an end-to-end learnable supervised deep network for graph matching, which is known in general NP-hard. The presented work is an extended version 1 to the preliminary conference version [22] and it involves the following features: i) We transform the graph matching problem to a linear assignment one by utilizing graph embedding network and specifically graph convolutional network, to extract the graph structures into node-wise feature vector i.e., graph embedding. The embedded node features are expected to contain structural information around the node such that even higherorder (beyond second-order) information can be incorporated in the matching procedure. In this way, the model circumvents the notoriously challenging QAP problem. Such a design also allows for different numbers of nodes in different graph pairs for training and testing. To our best knowledge, this is the first time for adopting a deep graph embedding network for learning graph matching. ii) Combined with our embedding model, a Sinkhorn net based permutation loss for combinatorial optimization is developed. It applies Sinkhorn iteration on the input non-negative matrix to obtain a doublestochastic matrix as the soft matching, such that the cross-entropy loss which is often used for classification can be readily used as the loss for measuring the difference between the soft matching matrix and the ground truth. Such a combinatorial loss also allows for flexible handling of varying-sized graphs which is a persistent challenge in graph matching. For instance, in the structured support vector machine based learning model [23], the number of graphs has to be fixed for training a model. However, such condition can be unrealistic in practice. To our knowledge, this is the first work for adopting the above permutation loss for graph matching. iii) Experimental results including ablation studies show the effectiveness of our devised components including the permutation loss, graph convolutional network based node embedding, convolutional network based node-wise image feature extraction layer, cross-graph affinity module, and the iterative cross-graph embedding component. In particular, our method outperforms the state-of-the-art peer method [24] based deep networks regarding with matching accuracy. Our method also outperforms [23] in accuracy while being more flexible as the method in [23] assumes fixed number of nodes for matching in both training and testing sets. Detailed study on transfer learning capability shows the robustness of our approach in the presence of outliers, when the training set and testing set are from 1. Compared with [22], the extensions include: i) a new iterative cross-graph embedding technique with the resulting method IPCA-GM; ii) an added treatment incorporating dummy nodes in Sinkhorn network against outliers with new experimental results; iii) comprehensive study on the behavior of our model on transfer learning across datasets, categories and with/without outliers, and the study on the learned CNN module and embedding module respectively; and iv) an extended literature review on graph matching that covers both learning-free and learning based approaches. These new results highlight the robustness of our approach in practice. different object categories, as well as from distinctive datasets. The paper is organized as follows. Section 2 discusses the related work on graph matching for both learning-free and learning based methods. The main approach is presented in Section 3, which is evaluated in Section 4 with peer methods. Section 5 concludes this paper.

RELATED WORK
Graph matching has been a long standing problem in vision and pattern recognition. There is a recent trend for learning of graph matching, which is the focus of this paper.
In this section, we first review the learning-free methods over the decades, and then discuss the recent line of research on learning based graph matching. Readers are referred to the survey [9] to take a detailed review. For learning, the basic idea is to model the feature extraction and representation of graph nodes and edges, as well as the affinity function between graphs. Moreover, the solver can also be learned in different ways. Some related techniques as adopted in our approach are also discussed.

Learning-Free Graph Matching
Over the decades, the majority line of research on graph matching are focused solving the constrained combinatorial optimization problem for graph matching, which assumes the affinity model is given. Along this direction, we review the related works in three aspects: i) two-graph matching, which is the classic setting for graph matching; ii) hypergraph matching namely higher-order graph matching, whereby the higher-order edge information is used in matching in contrast to the first case, in which only up to the second-order affinity is considered; iii) joint matching of multiple graphs and its online incremental matching setting.

Two-Graph Matching
Most existing works deal with the two-graph matching problem which is known NP-hard in general. Different (deterministic) optimization based techniques have been devised. The annealing based continuation method is adopted in early work [2] which addresses the second-order matching problem by iterative formulating and solving a linear assignment problem. Such strategy is widely adopted in followup works [1], [25] and the quadratic assignment model is explicitly formulated in [26] by introducing the socalled affinity matrix. Based on this compact writing, different relaxation techniques have been explored and a popular treatment is to relax the permutation matrix into a doublestochastic one which is in fact the convex hull in the continuous space [27]. Meanwhile factorization on the affinity matrix is also studied to reduce the space complexity for directly dealing with the affinity matrix, and a continuation method based on convex-concave relaxation [28] is developed. Meanwhile some (quasi-)discrete methods [29], [30] are also developed to avoid the truncation from continuous solution to the binary one which can incur unexpected accuracy loss. More recently, state-of-the-art graph matching solvers [31], [32], [33] are presented with high accuracy as well as heavy computational cost.

Hypergraph Matching
Beyond second-order edge information, hyperedges are also considered to improve the model robustness and expressiveness [3], [15], [34]. Though some improvements on accuracy have been achieved while the involved additional cost is not ignorable. To address this issue, the authors in [4] show a technique to decompose the higher-order (up to fourth order) model into second-order one such that the problem can be solved using classic second-order graph matching solvers. In another work [5], a discrete method is devised for more efficient hyperedge based matching.

Multiple Graph Matching
Recently it receives more attention for joint matching of multiple graphs. This setting is especially pronounced in realworld scenarios when it is required to matching a batch of graphs. Also, accessing multiple graphs simultaneously further provides the opportunity to fuse the local information from each graph, which can be noisy or even fundamentally ambiguous and difficult to process independently. These works usually fall into two categories, the first one divides the matching into two stages [18], [20], [35]: first performing two-graph matching independently and then smoothing the pairwise matchings by global cycle-consistency.
Here the concept of consistency refers to the correspondence loop closure across three or more graphs. Other works [16], [17] incorporate cycle-consistency into matching process over iterations, in addition with maximizing the affinity objective.
Graph data can arrive in a streaming way such that an incremental matching approach is welcomed to efficiently handle the new coming graphs and existing ones with matching results. The authors in [36] give one such solution by clustering the existing graphs into groups based on cluster-wise randomness and diversity, and incorporates the newly arrived graph by matching it with one of the clusters.

Modeling and Learning Affinity
To address the challenges of mimicking the real-world setting of the graph matching problem, learning the node-wise and edge-wise affinity has been an effective way of utilizing training data, either in a supervised [23], [24], [37], unsupervised [38] or semi-supervised [38] way. This is in contrast to the aforementioned methods that use simple and predefined parametric models that often involve a Gaussian kernel in the Euclid space for feature distance calculation.
One shallow and unified treatment is to design an affinity function FðG 1 ; G 2 ; pÞ in the form [23]: where p is the node mapping function from one graph to the other, and s v denotes the similarity value between node u and its mapping in the other graph pðvÞ, which are with attributes (or features) a u and a pðvÞ respectively. Similar notation holds for the edge similarity s e between edge a uv and a pðuÞpðvÞ . One can compute the accumulated similarity by re-weighting and adding up all the elements in FðG i ; G j ; pÞ with weighting column vector b, so that SðG i ; G j ; p; bÞ ¼ b > FðG i ; G j ; pÞ. As observed in [23], the above simple model can incorporate most previous shallow learning models [37], [38], [39]. Specifically, [37] uses a 60dimensional model s v for feature points and a binary similarity function s e for edges. In [38], a multi-dimensional s e is adopted to measure similarity while s v is ignored. In [39], 2dimensional s v and s e are devised to model appearance similarity, occlusion, and geometric compatibility.
Differing from the above non-deep learning models, the seminal work [24] shows how to integrate the feature extraction, affinity computing, spectral matching components into an end-to-end learnable pipeline via deep neural networks. The graph embedding is not considered in their approach and a regression based offset loss is used. In particular, we argue that a combinatorial loss can be a better choice which has been devised in the paper.

Graph Neural Networks and Embedding
Node embedding has recently been an active research area whose purpose is to embed the network structure information around the node into a vectorized feature. As such, traditional classification and regression based techniques can be readily reused on the graph-like data. One common technique refers to the so-called graph neural networks (GNN) [40]. In a GNN, node features are aggregated from its neighbors and often a same transfer function is shared among different nodes. As such, the output of GNN is invariant to permutations of graph elements. There are many followup variants such as the SNDE model [41] that is developed for deep node embedding by exploiting the first-order and second-order proximity jointly. Conversely there are some shallow embedding models which mostly only consider the structure rather than attribute of nodes and edges, including node2vec [42] inspired by skipgram language model [43], DeepWalk [44] based on random walk, as well as LINE [45] which explicitly defines first-order proximity and second-order proximity and builds the corresponding heuristics models. These shallow models can be more scalable on large networks compared SNDE. Nevertheless, all these models cannot be directly used for end-to-end training in graph matching, neither the embedding model is designated for matching which calls for particular discrimination among nodes.
In this paper, we adopt the graph convolutional network (GCN) [46] modeling graph structure whose parameters are learnable in an end-to-end fashion. Moreover, its output is also invariant to the permutation of graph elements. Also, the model can allow for different numbers of nodes for each graph for training.

Learning of Combinatorial Optimization
There are emerging works on using deep neural networks to solve the combinatorial problems. One advantage is that the learned model is expected to better capture the specific structure of the problem at hand, instead of a general solver that is not tailored to the dataset. Moreover, the deep networks can be computational friendly to GPUs, leading to efficiency in solving optimization problem compared with pure CPU based pipeline. For instance, the NP-hard Travelling Salesman Problem (TSP) problem is explored in [47] via graph attention network based method to find a tour. In [48], another classic NP-hard graph coloring problem is addressed via deep reinforcement learning, which uncovers new and effective heuristics for graph coloring. In [49] a deep leaning based approach is developed which involves permutation invariant objective functions to a set of nodes.
Recall graph matching bears the combinatorial nature and in general can be formulated as a quadratic assignment problem (QAP). In [50], a QAP solver is learned given predefined affinity matrix. While in our approach, the affinity model is part of the learning components hence the work [50] can be complementary to ours. On the other hand, as many traditional methods transform the QAP problem into linear assignment problem (LAP) in each iteration, learning of linear assignment can be of relevance. In particular, it is know that Sinkhorn algorithm [51] is the approximate and differentiable version of Hungarian algorithm [52]. Given predefined assignment cost, the LAP solver is learnt by the Sinkhorn Network in [53], whereby doubly-stochastic regulation is added on non-negative square matrix. Similarly, the Sinkhorn AutoEncoder [54] is devised to minimize Wasserstein distance in AutoEncoders. Reinforcement learning is adopted in [55] for learning a linear assignment solver. In DeepPermNet [56], the Sinkhorn layer is employed with a deep convolutional network, to solve a permutation prediction problem. However, DeepPermNet predicts a permutation matrix by fully-connected layers and is therefore not invariant to input permutation. Hence DeepPermNet need a predefined node permutation as reference which is unnatural for graph matching. Moreover, DeepPermNet is incapable to handle varying number of input nodes.
Our approach differs from the above methods in several ways. First our model computes node-wise similarity directly between two graphs, and the output of our network is invariant to input node permutations. Meanwhile, our model involves an affinity learning module to encode the structure affinity into node-wise embeddings. By doing so, graph matching is transformed into a linear assignment task which can be readily solved by the Sinkhorn layer, which can also be called permutation learning.

Notations
From dataset D, we consider the matching between graphs Without loss of generality, we assume N 1 N 2 for simplified notations. Graphs are indexed by subscript s and nodes are indexed by subscript i in the following content. The connectivity of graph s is represented by adjacency matrix A s .
In particular, for graphs built on images as mainly assumed in the paper, we note each node to a 2D pixel coordinate in image I s denoted as P si ¼ ðx; yÞ. While h ðkÞ si represent the node embedding vector at keypoint i, layer k in graph s. M 2 R þN 1 ÂN 2 is a non-negative affinity matrix encoding node-node similarity between two graphs. S 2 ½0; 1 N 1 ÂN 2 is a so-called doubly (sub-)stochastic matrix, which satisfies S1 ¼ 1 and S > 1 1. S is denoted as the soft matching result of our model. Node-to-node matching is also represented by

Approach Overview
We present three deep graph matching methods: i) permutation loss and intra-graph affinity based graph matching (PIA-GM), ii) permutation loss and cross-graph affinity graph matching model (PCA-GM) and iii) iterative permutation loss and cross-graph affinity based one (IPCA-GM). All models are built upon a deep network where both image feature and structure information are exploited, and a Sinkhorn network predicting permutation in a differentiable fashion allowing for gradient back propagation. Among them, PIA-GM and PCA-GM are vanilla feed-forward embedding networks where PCA-GM adopts an extra cross-graph component which aggregates cross-graph features. IPCA-GM further exploits an iterative update scheme for cross-graph aggregation. Fig. 1a summarizes vanilla feed-forward methods PIA-GM and PCA-GM and Fig. 1b shows the iterative update scheme IPCA-GM.
The proposed models are built on a CNN feature extractor, a graph embedding component, an affinity metric and a permutation prediction layer. From the given node (i.e. keypoint) positions, graph structures are built by Delaunay triangulation. CNN (VGG16 in our model) extracts image features as graph nodes, and they are aggregated through intra-graph and cross-graph embedding layers. The network predicts a permutation for node-to-node correspondence from raw pixel inputs.

Node Feature Extraction on Images
A CNN architecture is adopted to extract features from raw RGB images, where node features are constructed by bi-linear interpolation on CNN's feature map. The extracted feature on the keypoint P si of image I s is: where InterpðP; XÞ bi-linearly interpolates on point P from feature map X. CNNðIÞ feeds image I into a CNN and outputs a feature tensor. Inspired by Siamese Network [57], the same CNN structure and weights are shared between two input images. We extract feature vectors from CNN layers at different depth for fusing both local texture and global semantic features. For fair comparison with [24], VGG16 pretrained with ImageNet on the classification task [58] is adopted as the CNN backbone.

Intra-Graph Node Embedding
Researchers have shown that graph matching methods exploiting graph structure can reach a more robust matching [9] against point based methods [6], [7]. In PIA-GM, graph affinity is embedded by multiple embedding layers encoding higher-order information in graphs. We develop an intra-graph message passing scheme based on the popular embedding approach GCN [46], where features are effectively aggregated from adjacency nodes including a selfupdate path for node features: Here Eq. 6 updates feature along edges and f msg denotes the message passing function. As a common practice, we normalize the aggregated features from adjacent nodes by the total number of adjacent nodes to avoid bias brought by varying numbers of neighbors owned by different nodes. Eq. 7 updates feature for each node via a self-passing function f node . With f update , Eq. 8 accumulates information from adjacent nodes and the node itself to update the state of node i. f msg ; f node ; f update may take any differentiable form in implementation. In our models, we design single-layer neural networks with ReLU [59] activation for f msg ; f node , and summation for f update . Note Eqs. 6, 7, 8 are denoted as graph convolution (GConv) from layer k À 1 to layer k: Eq. 9 represents one node embedding layer. The connectivity of graph s is encoded by adjacency matrix A s 2 f0; 1g NÂN . h SinkhornðMÞ; 4 // cross-graph aggregation Eqs. (12), (13)

Cross-Graph Node Embedding
We improve from intra-graph embedding by a cross-graph aggregation step, where features are aggregated from the other graph with nodes containing similar features. The hope is to better fuse the information between two relevant graphs via joint embedding. First, we compute similarity based on graph features from shallower embedding layers and utilize a Sinkhorn network to predict a doubly-stochastic similarity matrix (see details in Section 3.7 and Section 3.8).
The predicted cross-graph similarity matrixŜ encodes the node-to-node similarity between two graphs.
S ¼ SinkhornðMÞ (11) where f aff represents the affinity measure (see Eq. (17)), and the superscript (1) means h 2j are from the output of first graph convolution layer.
We adopt a similar message passing scheme from intragraph convolution introduced in Eq. (9), while the adjacency matrix is replaced by cross-graph similarity matrixŜ, and features are aggregated across two graphs. Note thatŜ is doubly-stochastic therefore we need no normalization for cross-graph aggregation. The cross-graph embedding is: where f msgÀcross ; f nodeÀcross are implemented as identity mapping, f updateÀcross is a concatenation of two input feature tensors followed by a linear layer. Introducing learnable parameters to f msgÀcross ; f nodeÀcross may result in training instability, as witnessed in experiments. For graph pairs the cross-graph aggregation scheme can be denoted as: whereŜ denotes the predicted correspondence from G 2 to G 1 andŜ > denotes such relation from G 1 to G 2 .Ŝ can be regarded as an early-stage prediction on matching. We summarize this (vanilla) cross-graph embedding approach as CrossEmbðÁÞ in Algorithm 3.5.

Iterative Cross-Graph Node Embedding
The cross-graph embedding scheme introduced in Algorithm 3.5 considers the cross-graph matching relationship at shallower embedding layers, which is proved to advantage matching by experiments in [22]. However, Algorithm 3.5 only computes the matching by single forward pass, which is relatively simple and its prediction can be further improved. With the motivation that more precise cross-graph prediction will probably lead to better embedding feature and vice versa, we design and experiment an iterative update approach for a more accurate prediction of cross-graph similarity, where the cross-graph similarity matrixŜ is predicted iteratively. The embedding layers (including intra-graph embedding layers) in our iterative cross-graph embedding design is summarized by IterCrossEmbðÁÞ in Algorithm 3.5. In this alternative design,Ŝ ð0Þ is initialized as zero matrix, and we iteratively predictŜ ðkÞ fromŜ ðkÀ1Þ through both crossgraph embedding (L2 -L2) and intra-graph embedding (L2) layers. We predict a similarity matrix from embedding outputs (L2), and finally calculate a doubly-stochastic matching matrixŜ by Sinkhorn algorithm (L2). Note that the predicted matching matrixŜ is regarded as the weighting matrix of cross-graph update paths. As will be shown in the experiment part, further improvement can be brought by this iterative cross-graph design.

Affinity Metric Learning
With the proposed embedding model, we encode the structural information between two graphs into node-to-node similarity in the embedding space. Such an embedding scheme allows to simplify the traditional second-order affinity matrix K in Eq. (1) into a linear one. Then the NPhard problem in quadratic form is reduced into a linear one which is solvable in polynomial time. Let h 1i and h 2j be the feature of i and j from the first graph and the second graph, respectively: The node-wise affinity matrix M 2 R þ N 1 ÂN 2 encodes any node-to-node similarity between two graphs. M i;j corresponds to the affinity score between node i in the first graph and node j in the second graph, considering node features and higher-order information inside and across graphs.
We assign f aff as a weighted bi-linear function followed by an exponential function forcing all elements in the affinity matrix being positive 2 .
If the feature is a m-dimensional vector, i.e., 8i 2 V 1 ; j 2 V 2 and h 1i ; h 2j 2 R mÂ1 , A 2 R mÂm contains learnable weights for this affinity function. t is a regularization parameter controlling the discriminate level of the affinity function. For t > 0, with t ! 0 þ , Eq. 17 becomes more discriminative, while there will be a higher chance of explosive gradient if t is too small.

Sinkhorn Layer for Linear Assignment
With the node-wise affinity matrix in Eq. (17), Sinkhorn method is used for linear assignment, where the discrete assignment constraint is relaxed to doubly-stochastic matrix.
Sinkhorn layer takes any non-negative square matrix as input and outputs a doubly-stochastic matrix as the predicted matching result. The doubly-stochastic matrix is considered as a continuous relaxation of permutation matrix. Sinkhorn networks have been shown effective in network based permutation prediction [53], [56]. We initialize M ð0Þ ¼ M. For M ðkÀ1Þ 2 R þ NÂN , the Sinkhorn operator is: where means element-wise division, and 1 is a column vector whose elements are all ones. Sinkhorn algorithm involves alternatively taking row-normalization (Eq. (18)) and column-normalization (Eq. (19)) till convergence. It is found about 10 iterations often leads to a satisfying result. After passing the linear affinity matrix into Sinkhorn network we get a doubly-stochastic matrix S, which is treated as our model's prediction in training. The Sinkhorn layer is summarized as: If two graphs are of equal size i.e., N 1 ¼ N 2 ¼ N, then M 2 R þNÂN , the Sinkhorn algorithm works straight forward. If there exist outliers in graph 2, i.e., N 1 < N 2 , M 2 R þN 1 ÂN 2 is no longer a square matrix and cannot be directly handled by Sinkhorn network. Under such circumstance, we add dummy nodes to G 1 by padding M by zeros into a N 2 Â N 2 square matrix. After the Sinkhorn step, the padded zero elements are discarded in the resulting soft matching matrix S 2 ½0; 1 N 1 ÂN 2 . We further apply loss metric on it. Our treatment on outliers is summarized in L3 of Algorithm 3. Adopting dummy nodes is a common practice when matching graphs against outliers [1]. It is also a standard technique in linear programming.
Note Sinkhorn net is differentiable as its operation only involves matrix-vector multiplication and element-wise division, making it appealing to end-to-end pipelines. The backward gradient of Sinkhorn layer derived from [24] is: where ½ Á builds a diagonal matrix from the given vector. While Sinkhorn can be efficiently implemented with 2. We have also experimented other more flexible fully-connected and attention-like layers, while we empirically find the simple exponential function is more stable for learning. automatic differentiation available in PyTorch [60]. During testing, we perform Hungarian algorithm [61] on S as a final discretization step to obtain a binary permutation matrix X. X ¼ HungarianðSÞ:

Permutation Cross-Entropy Loss
Our methods directly utilize node-to-node matching labels, i.e., permutation matrix, as the supervision for end-to-end training. As the matching result computed via Sinkhorn layer in Eq. 20 is a doubly-stochastic matrix, we propose a linear assignment based permutation loss to evaluate the difference between predicted doubly-stochastic matrix and ground truth permutation matrix for training. Our models are trained end-to-end based on cross entropy loss. We take the ground truth permutation matrix X gt provided by the dataset, and compute the cross entropy between S and X gt . We denote it as permutation loss, which is the supervision adopted to train most of our deep graph matching models L perm : Note GMN [24] applies "displacement loss" based on pixel offset. It computes an offset vector d from all matching candidates by a weighted summation. The offset loss is computed as the difference between predicted offset vector and ground truth offset vector.
where fP 1i g; fP 2j g are the keypoint coordinates in G 1 and G 2 , respectively. While is a small value added for numerical stability. In comparison, our cross entropy loss can directly learn a linear assignment cost based permutation loss in an end-to-end fashion.

Further Discussion
Our proposed deep graph matching methods PIA-GM, PCA-GM and IPCA-GM are summarized in Algorithm 3.
Here we further discuss our network design including embedding layers, Sinkhorn network and permutation loss compared to peer methods.

Predefined Affinity Versus Embedding
Existing algorithms in graph matching aims to model second-order [1], [26] and higher-order [3], [5] graph affinity feature with an explicitly pre-defined affinity matrix or tensor. An N 1 N 2 Â N 1 N 2 affinity matrix can be adopted to encode node-wise and edge-wise affinity information. Optimization techniques are applied to compute the matching result via graph affinity maximization. In contrast, we resort to the node embedding technique with two merits. First, the space occupation of the affinity matrix is reduced to N 1 Â N 2 . Second, the embedding layers can implicitly model higher-order feature in graphs, while only the second-order edge information can be encoded by the affinity matrix K in Eq. (1).

Sinkhorn Net Versus Spectral Matching
Spectral matching (SM) [26] is adopted by GMN [24]. The SM solver actually computes the leading eigenvector of K and is capable for back propagation. In contrast, we adopt the Sinkhorn net instead. In fact, the input of Sinkhorn is of complexity OðN 1 N 2 Þ while the input size is of OðN 2 1 N 2 2 Þ for spectral matching. Meanwhile, it is observed that SM takes more iterations to converge. We empirically observe there exists an optimal value for the number of iterations, as heavy iteration may bring negative effect to the accuracy of back propagation. In fact, spectral matching is designed for graph matching while Sinkhorn net is for linear assignment, which is relaxed from NP-hard graph matching owing to the embedding layers.

Pixel Offset Loss Versus Permutation Loss
The method GMN [24] adopts a pixel offset loss function named "displacement loss". The loss first computes a weighted sum from all matching candidates, and then compute the so-called offset vector from the source image (G 1 ) to the target image (G 2 ). Under the supervision of offset loss, GMN learns to minimize the difference between predicted offset vector and ground truth offset vector on training samples. In comparison, based on the Sinkhorn net, a permutation loss is computed where the cross entropy between predicted matching and ground truth matching is evaluated. Such permutation loss directly takes the ground truth matching relationship as supervision, and utilize such information for end-to-end training. The performance gap between offset loss and permutation loss becomes more significant with the existence of outliers, as will be shown in the experiments. Permutation loss models the combinatorial nature of graph matching problems. Figure 2 presents a real failure case produced by using the offset loss. In this case, the model makes a poor prediction, but the offset loss is unreasonably low. In contrast, the permutation loss provides correct information. Experiments also reveal that our permutation loss models surpass models trained with offset in matching accuracy.

EXPERIMENTS
We report the performance on both synthetic data and realimage datasets for semantic point matching. State-of-the-art peer methods including both deep learning [24] and shallow learning [23] are compared in our experiment. We further show that our methods enjoy some robustness against outliers, different datasets and different categories. Detailed results are also provided for further insights on our proposed methods. Experiments are conducted on our workstation with i9-7920X@2.90 GHz and quad RTX2080Ti GPU. We also provide a project homepage: http://thinklab.sjtu. edu.cn/IPCA_GM.html.

Evaluation Metrics and Peer Methods
We evaluate the matching accuracy between two given graphs G 1 ; G 2 , whose sizes are of N 1 and N 2 , respectively. We evaluate the models' performance under two configurations: 1) N 1 ¼ N 2 so that there exists no outliers between two graphs. Each node in one graph is labeled to another node in the other graph, i.e., a bijection between G 1 and G 2 ; and 2) N 1 N 2 as we allow outliers exist in G 2 . The matching problem becomes more challenging and it is more appealing to real world applications. The model predicts a node-to-node correspondence between two graphs. Such correspondence is represented by a permutation matrix X.
We evaluate the matching accuracy computed from the permutation matrix, by the number of node pairs correctly matched averaged by the total number of ground truth node pairs. For settings with and without outliers, given a predicted permutation matrix X 2 f0; 1g N 1 ÂN 2 and a ground truth permutation X gt 2 f0; 1g N 1 ÂN 2 , the matching accuracy in category C is computed by: where < Á; Á > denotes inner product and graph pairs are sampled from the given category C.
When matching on graphs from real-world images, each node i of graph s corresponds to a keypoint with position P si . In our experiments, P si is taken from the dataset's ground truth annotation and fed into the network.
The evaluation involves the following peer methods: GMN. Graph Matching Network (GMN) [24] is the seminal model built on VGG16 [62] network to extract image features. GMN extracts shallower (relu4_2) and deeper (relu5_1) CNN features as node and edge features, respectively. Graph matching is tackled by spectral matching (SM) [26], which is an unlearnable graph matching solver. This model is agnostic to object categories, meaning an universal model is learned for all instance classes. G 1 and G 2 are constructed by Delaunay triangulation and as fully-connected, respectively. GMN is the first end-to-end pipeline for deep graph matching which incorporates deep CNNs. Note there exists a major difference that their loss function is an offset based loss given by Eq. 26. We follow [24] and re-implement GMN with PyTorch as their code is not publicly available at the time when the paper is written. Also, we modify the the raw GMN model by replacing the regression based node position output with the permutation matrix for node matching, in order to make GMN consistent with the graph matching evaluation protocol i.e., matching accuracy.
HARG-SSVM. It is a graph matching learning method based on structured SVM [23]. We adopt it as the baseline for shallow graph matching learning methods. HARG-SSVM is class-specific, where it learns a graph model for each object category. We use the source code released by the authors upon their approval. The experimental setting in [23] assumes that keypoint position of the target graph G 2 is unknown, and they adopt a Hessian detector [63] to produce candidate positions. In our setting, however, all candidate positions are known to the model and we slightly modify the originally released code. From all candidate points found by the Hessian detector, we assign the nearest neighbor from ground truth position as matching candidate. We discovered such practice is originally taken during the training process of HARG-SSVM. In experiments, graphs are constructed with handcrafted features namely HARG. . Offset loss is computed by a weighted sum among all candidates, resulting in a misleading low loss 0.070. In this example offset loss fails to distinguish between left/right ears and such information will not be learned under its supervision. Our permutation loss, on the contrary, issues a reasonably high loss 5.139. The underlying rationale is that the problem at hand is fundamentally a combinatorial problem rather than a regression task.
PIA/PCA/IPCA-GM. We build our models on VGG16 [62] as the CNN backbone, and features are extracted from relu4_2 and relu5_1 in line with [24]. We concatenate these features to fuse both local and global information. In PIA-GM, we build 3 intra-embedding layers on top of VGG16 while in PCA-GM, the embedding module contains sequentially 1 intra layer, 1 cross layer and 1 intra layer. IPCA-GM is built in line with PCA-GM, while the cross-graph similarity matrix is updated iteratively. All methods contain an affinity mapping as Eq. (17)  For natural image tests, we randomly draw image pairs from the dataset, and build two graphs from the given keypoints. Graph structures are agnostic to the model, and graphs are constructed according to different methods (see discussions above). The CNN model is pretrained on Image-Net [58] classification task with 21,841 subcategories and 14 million images, downloaded from PyTorch model zoo.

Synthetic Keypoint Matching
We first evaluate models on generated synthetic graphs, where the protocol is built following [1]. Graphs are generated with a given inlier keypoint number K in . Each inlier is assigned with a 1024-dimensional random vector simulating CNN feature  [26], BPF-G [32], SK-JA [33] are evaluated with respect to K in , s feat and K out . For each trial, we generate 300 random graphs (200 for training and 100 for testing) from the same distribution. We conduct 10 trials under each setting and report the averaged accuracy. All generated samples are cached and shared among all methods. Fig. 3 contains our experimental result showing the robustness of our PCA-GM against feature deformation and complicated graph structure. Note that compared to GMN-PL supervised by permutation loss, the performance of GMN supervised by offset loss degenerates significantly when the number of outlier increases. The learning based methods also surpass learningfree baselines [26], [32], [33].

Pascal VOC Keypoints Matching
We experiment on large-scale real image matching dataset with Pascal VOC [64] with additional keypoint annotation from [65], namely Pascal VOC Keypoint. It contains 20 classes of instances with ground truth keypoint positions. Following [24], we filter the original dataset into 7,020 training images and 1,682 testing images. Note that the number of training samples grows combinatorially w.r.t. number of images since we can iterate over all combinations of image pairs. During experiment, we randomly draw image pairs from the same category in the dataset. All instances are cropped around its bounding box and resized to 256 Â 256 before passing to the network. We perform experiment under two settings on Pascal VOC Keypoint: 1) A simplified setting N 1 ¼ N 2 without outliers where we leave only the keypoints co-existent in two images (i.e., inliers) to be matched; 2) A more challenging setting N 1 N 2 where G 1 contains only inliers, but outlier keypoints are not filtered in G 2 . Problems with N 1 < 3 is not considered in both settings since they are relatively easy to solve. We regard Pascal VOC Keypoint a challenging dataset under both setting as objects vary greatly in their pose, texture and illumination, and the number of labeled keypoints ranges from 6 to 23. We test on Pascal VOC Keypoint [65] and report performance on 20 Pascal categories. We compare GMN, GMN-PL, PIA-GM, PCA-GM, IPCA-GM and present detailed experimental results in Table 1 with (white background) and without (gray background) outliers. Our proposed methods PIA-GM, PCA-GM, IPCA-GM surpass competing methods in most categories, especially the mean accuracy over 20 categories, and they are robust with the existence of outliers (see detailed discussion in Section 4.4.1). With dual RTX2080Ti GPUs, PCA-GM runs at around 18 pairs per second and IPCA-GM runs at about 16 during training. The result shows the superiority of the combinatorial permutation loss over offset loss in training, embedding and Sinkhorn over fixed SM [26] in affinity modeling, and cross-graph embedding, especially iterative cross-graph embedding over intra-graph embedding in the embedding module.

Transfer Learning Experiments
As summarized in Figure 4, the robustness and generalization capability of our methods are tested by the following aspects: transfer learning by using outlier-free training data and the testing data from the same dataset (but different samples) added with outliers ( Fig. 4a green), transfer learning across different datasets (Fig. 4a red), generalization capability when training and testing instances are from different categories (Fig. 4b) and finally the generalization behavior of learned CNN and embedding modules (Fig. 4a blue and purple).

Transfer learning in the presence of outliers
We experiment the transfer learning capability of our proposed methods against the existence of outliers. As illustrated by the green dashed arrow in Fig. 4a, we first pretrain models on Pascal VOC Keypoint without outliers, and directly apply them on the dataset with outliers. These models are later finetuned on Pascal VOC Keypoint with outliers. Experimental result reported in Table 2 shows the superiority of our proposed methods in transfer learning against outliers. Direct learning with outliers seems challenging for IPCA-GM, and one possible reason is the predicted cross-graph similarity matrix may be sub-stochastic since there are outliers (the summations of columns may be less equal to 1), misleading the iterative cross-graph update scheme at early training stage. PIA-GM also fails to converge when directly trained with outliers, which may be caused by its poor model capacity. In contrast, transfer learning from outlier-free models will result in better performance and more stable training. The outlier results in Table 1 are based on transfer learning.

Cross-Dataset Transfer With Willow ObjectClass
Knowledge transfer between different datasets is experimented on Willow ObjectClass dataset, which is a real image Note after replacing the offset loss by permutation loss, GMN-PL outperforms GMN [24] almost in all categories. PIA-GM surpasses GMN baselines, while both PCA-GM and IPCA-GM further boost the matching accuracy, by utilizing (vanilla) cross-graph embedding and iterative cross-graph embedding, respectively. Results with white background are obtained without outliers, and the gray background are with outliers. Note that here we report improved results for PCA-GM compared to [22] by changing the hyper-parameter t from 0.005 to 0.05. matching dataset collected by [23]. We follow the source code released by the authors of [23], with which HARG-SSVM is trained and evaluated. Following the red dashed arrow in Fig. 4a, cross-dataset generalization capability is evaluated for other competing methods. For deep graph matching models, their weights are first initialized on a slightly modified Pascal VOC Keypoint dataset, from which we remove all VOC 2007 car and motorbike images. These transferred models are denoted as GMN-VOC, PCA-GM-VOC and IPCA-GM-VOC. We further fine-tune them on the Willow dataset, namely GMN-Willow, PCA-GM-Willow and IPCA-GM-Willow, respectively. Note that HARG-SSVM is class-specific so that it learns a specific model for each class, while GMN, PCA-GM and IPCA-GM are class-agnostic and a unified model is learned for all classes. We only report crossdataset transfer learning result because Willow ObjectClass is too small to train a deep graph matching model from scratch. Table 3 demonstrates our proposed PCA-GM and IPCA-GM showing superior transfer learning capability across different datasets, and outperform all competing methods in all categories of Willow ObjectCalss dataset.

Cross-Category Generalization Capability
To testify the generalization behavior of our model among different categories, we train IPCA-GM, PCA-GM, PCA-GM-OL, GMN-PL, GMN on eight arbitrarily selected categories in Pascal VOC Keypoint and report testing result on each category as shown in Fig. 5. The experimental setup is illustrated by Fig. 4a. We follow the train/test split provided  Here we plot two lines of confusion matrices in parallel for better illustration. For blue matrices, the color map stands for accuracy normalized by the highest accuracy on this column, and they do not denote the absolute value of accuracy among different categories and matrices. For the orange matrices, we plot the ranking of accuracy on this current cell among all 5 confusion matrices. Darker color corresponds to higher ranking in accuracy. We also report accuracy for elements in diagonal and overall for each confusion matrix, as shown in brackets on the top of blue matrices (best viewed in color and zoom in for better view).
by the benchmark for each category. Cross-category test result is plotted via confusion matrix (y-axis stands for training category and x-axis stands for testing category) with blue color denoting relative accuracy and orange color denoting the ranking among 5 models on each certain cell. Our learned embedding models generalize soundly to unseen similar categories, such as between cat and dog. It shows that embedding based models generalize better to unseen categories on off-diagonal cells, while the permutation loss offers better supervision on diagonal elements. We notice that IPCA-GM ranks comparatively on diagonal cells with PCA-GM, while PCA-GM and PCA-GM-OL generalize better on off-diagonal cells. A possible explanation may be IPCA-GM is with higher model capacity, therefore it fits better on its training category and achieves higher accuracy on diagonal elements in confusion matrix.

Generalization of CNN and Embedding Modules
Our deep graph matching pipeline generally learns image feature (with CNN) and graph feature (with embedding) simultaneously. We conduct further study on how they generalize with other modules and how do they couple with each other.
In this experiment, we decompose the learned PCA-GM into two parts: a CNN backbone (denoted as "DGM") tuned by the matching dataset at hand, and the embedding component, and plug them into existing graph matching solver RRWM [1] and a vanilla VGG16 pretrained on ImageNet classification, respectively. In Fig. 4a, blue arrows represent "ImgNet +Embed" and the purple arrow represents "DGM+RRWM".
Experimental result in Table 4 shows that the learned CNN module generalizes soundly after replacing the embedding module by RRWM, especially on categories such as bike, chair and table. In comparison, the learned embedding module seems more tightly coupled with the learned CNN. However, the "ImgNet+Embed" configuration outperforms "ImgNet +RRWM" in all categories, and in some categories "ImgNet +Embed" performs comparatively with "DGM+RRWM". Therefore, the learned embedding module is still meaningful when transferred to different CNN backbones.

Model Design Details
There are few hyper-parameters for tuning, including number of embedding layers, regularization factor t of affinity metric and the number of cross-graph iterations in IPCA-GM. The parameter configuration is determined by averaged accuracy on Pascal VOC Keypoint dataset without outliers, and applied to all datasets under various settings. For the number of embedding layers, we test PCA-GM and PIA-GM with varying configurations, and find PIA-GM is insensitive to the number of layers while PCA-GM best performs with 3 layers, as shown in Table 5. Introducing more than one cross-graph layers will also make the model fail to converge. A possible explanation to this phenomenon is deep graph convolutional nets may suffer from oversmoothing [66], especially for our cross-graph convolution. We keep all networks with 3 embedding layers for fair comparison. For hyperparameter setting, we perform grid search on t ¼ f0:5; 0:05; 0:005; 0:0005g and number of cross-graph iterations ¼ f2; 3; 4; 5g, and we adopt the best-performing t ¼ 0:005, with 3 iterations for IPCA-GM and t ¼ 0:05 for PCA-GM and PIA-GM under all settings. We regard such parameter setting a suitable choice for various scenes and graph sizes, as Pascal VOC Keypoint dataset contains 20 cat-egories with various graph sizes. As shown in Table 6, more complicated IPCA-GM seems more sensitive to parameter t compared to PCA-GM and PIA-GM, since it is crucial to select a suitable discriminative level of affinity metric at early iterations in IPCA-GM. And there will be a higher chance of explosive gradient if t is too small, causing PIA-GM and IPCA-GM fail to converge with t ¼ 0:0005.

IPCA-GM Components
We conduct ablation study by enabling/disabling different IPCA-GM components and report results in Table 7. The ablation study reveals that all our components have positive effect on matching accuracy. We initialize VGG16 by pretrained weights on ImageNet classification, embedding layers by random weights from uniform distribution, and the weighting matrix of affinity function by identity matrix disturbed by uniform random noise.

Cross-Graph Component Design
The the cross-graph matrix in iterative cross-graph in Algorithm 3.5, is initializedŜ ð0Þ as a zero matrix. We experiment a family of models whereŜ ð0Þ is initialized by the output of first intra-graph embedding layer h ð1Þ 1i ; h ð1Þ 2j namely "Intra 1 -IPCA-GM", and we find they do not perform com- For column "CNN", "DGM" indicates the CNN module learned with PCA-GM and "ImgNet" means the CNN is only pretrained on ImageNet. For column "graph", "Emb" corresponds to learned cross-graph embedding in PCA-GM while "RRWM" means the existing RRWM solver [1]. The first row is identical to PCA-GM baseline in Table 1.   paratively compared to our zero-initialization design. We also test with f msgÀcross ; f msgÀcross in Eqs. (12), (13) implemented by single fully-connect layer followed by ReLU activation, however the model fails to converge, which may be due to too much complexity in the cross-graph layer. As shown in Table 8, too many cross-graph iterations (e.g., 4; 5) will degenerate the model's performance, even causing the model fail to converge. A possible explanation would be the heavy iteration adopted (including iterative cross-graph update and Sinkhorn iterations) may cause instability for backward gradient computation. We empirically find 3 iterations for IPCA-GM introduced in Algorithm 3.5 best performs among all methods, therefore we stick to this design in this paper. The convergence property of the iterative cross-graph embedding in IPCA-GM is also studied, and in experiment the predicted cross-graph similarity matrixŜ ðkÞ gradually converges when the number of iterations k grows. By adopting the IPCA-GM model weights reported in Table 1, we report its test result with 5, 10 and 20 cross-graph iterations in Table 9. Interestingly, the performance does not vary much compared to the original configuration, therefore we stick with 3 iterations for its cost-efficiency.
Compared to Algorithm 3.5, the iterative update scheme discussed in Section 4.5 of [22] can be improved. It will become identical to Algorithm 3.5 if we assign fh ð3Þ 1i g; fh ð3Þ 2j g to fh ð1Þ 1i g; fh ð1Þ 2j g after L2 inside the loop. Such an update scheme assumes features in the first and the last embedding layers are in the same feature space, which violates different feature mapping functions in different embedding layers. Therefore, the iterative update scheme in Section 4.5 of [22] performs poorly in experiments.
Recently, another differentiable replacement of Sinkhorn module for linear assignment problem is proposed by [67]. However, it seems to suffer from poor convergence speed compared to Sinkhorn algorithm under our deep graph matching setting. We empirically find [67] needs around 200 inner loops and 100 outer loops to converge. In contrast, Sinkhorn algorithm requires only 10 iterations. Thus we stick to Sinkhorn algorithm in our model design.

CONCLUSION AND FUTURE WORK
In this paper, we have presented a novel deep graph matching pipeline, which parameterizes the graph matching affinity with deep CNN and novel embedding layers. To model the arbitrary transformation between two graphs, a permutation loss is proposed as the learning objective. We demonstrate our methods achieve state-of-the-art matching accuracy, robustness against outliers, and generalization capability across different datasets and different categories with extensive experimental results, including an ablation study on proposed components and the comparison with peer methods. Future work may explore the semi-supervised and unsupervised settings by incorporating the cycle consistency over multiple graphs. The model weights are all from the learned IPCA-GM model reported in Table 1, trained with 3 iterations. Xiaokang Yang (Fellow, IEEE) received the BS degree from Xiamen University, in 1994, the MS degree from the Chinese Academy of Sciences, in 1997, and the PhD degree from Shanghai Jiao Tong University, in 2000. He is currently a distinguished professor of the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China. His research interests include visual signal processing and communication, media analysis and retrieval, and pattern recognition. He serves as an associate editor of IEEE Transactions on Multimedia and an associate editor of IEEE Signal Processing Letters.