Divide and Conquer: Subset Matching for Scene Graph Generation in Complex Scenes

The goal of scene graph generation (SGG) is to classify objects and their pair-wise relationships in a visual scene. Object occlusion is a critical challenge when generating scene graphs in complex scenes. However, this issue has rarely been explored in recent works. Accordingly, in this paper, we propose a subset matching network (SM-Net) that handles the above problem. First, we decompose SGG into two types of subset matching problems: node subset matching and edge subset matching. Each node/edge subset handles the occlusion between one node/edge pair, thereby reducing the difficulty of SGG in a "divide and conquer" manner. Second, we introduce a node subset prediction module that utilizes a subset-based message passing module to refine the node subset representation and a matching loss to supervise node subset prediction. Third, we propose an edge subset prediction module that applies a feature selection-based fusion function to obtain edge subset features and a matching loss to supervise edge subset predictions. Experiments on three popular datasets show that our model achieves state-of-the-art performance. The code of SM-Net will be released.


I. INTRODUCTION
S CENE graph generation (SGG) aims to provide a graphbased representation for a given image. A scene graph consists of objects as nodes and relationships as edges. As illustrated in the dotted box of Figure 1, two objects and a relationship form a triplet subject-predicate-object. Specifically, an object is denoted as a node with a category label, and the relationship between two nodes is characterized by a directed edge between them. The edge has a specific category of predicates and points from the subject towards the object. Recent works have shown the benefits of SGG for object detection [1], semantic segmentation [2] and highlevel vision tasks, e.g., 3D scene understanding [3], visual question answering [4], image captioning [5].
Despite the effectiveness of existing SGG works, the application of SGG in complex scenes involving many objects remains a challenging task. Specifically, occlusion between objects is more likely to occur when their number increases. Object occlusion usually results in a large intersection over union (IoU) between the bounding boxes of nearby nodes. As shown in Figure 1(a), a large proportion of triplets in existing datasets [6]- [8] suffer from the occlusion problem.
Unfortunately, existing SGG works [9]- [14] typically make predictions for each element in the scene graph independently, without considering the challenges that arise due to occlusion. Consequently, as shown in Figure 1(b), these SGG methods [9]- [14] suffer from a performance degradation when the occlusion level increases. More specifically, this is because occlusion reduces the difference between the appearances of nearby nodes, making prediction in both the object and relationship categories more challenging. As illustrated in Figure 1 (c), the bounding boxes of fork and knife are highly overlapped. It is hard to distinguish between these two nodes. Moreover, as shown in Figure 1(d), due to the bounding boxes of man and boy having large IoU, it is difficult to predict their related relationships (e.g., manride-bike and boy-sit-on-bike).
We discover that grouping occluded nodes/edges into a subset could aid the network in learning which types of nodes/edges frequently appear together, thus improving occluded nodes/edges prediction performance. Accordingly, we propose a subset matching network (SM-Net) that jointly predicts the categories of nodes and edges in a scene graph rather than predicting each of them individually. In more detail, we divide the SGG task into two types of subset matching problems: node subset matching and edge subset matching. For the former problem, we propose a node subset prediction (NSP) module to mitigate the influence of occlusion on node prediction. It consists of two components: First, a subsetbased message-passing module is applied to refine the node subset representation using the subset-specific contextual information. Second, a loss is utilized to minimize the matching cost between the predicted and ground-truth node subsets. For the edge subset matching problem, we propose an edge subset prediction (ESP) module to alleviate the impact of occlusion on edge prediction. It consists of two parts: First, a subset-based edge fusion function is applied to obtain the edge subset representation. Second, a matching loss is utilized to penalize the difference between the predicted and ground-truth edge subsets.
In summary, the innovation of the proposed SM-Net is three-fold: (1) the SGG problem is decomposed into node subset matching and edge subset matching problems; (2) an NSP module that utilizes subset-based message passing to refine the node subset representation and a matching loss to minimize the gap between node subset prediction and ground-truth; and (3) an ESP module that applies subsetbased edge fusion function to obtain the edge subset representation and a matching loss to minimize the gap between edge subset prediction and ground-truth. The effectiveness of SM-Net is systematically evaluated on three popular SGG datasets: Visual Genome (VG) [6], OpenImages (OI) [8], and Visual Relationship Detection (VRD) [7]. Experimental results demonstrate that the proposed SM-Net consistently outperforms the state-of-the-art methods.
The remainder of this paper is organized as follows: First, related works on SGG and graph similarity learning are briefly reviewed in Section II. Second, the SM-Net problem formulation and model structure are presented in Section IV. Third, detailed experiments are conducted and analyzed in Section V. Finally, we conclude this paper in Section VI.

A. SCENE GRAPH GENERATION
Existing SGG models comprise two key stages: object detection and relationship classification. The first of these stages detects objects using models such as Faster R-CNN [24]. Subsequently, using the obtained object categories and features, the second stage recognizes the relationship category for each object pair. Depending on the method utilized to obtain the context information, existing works can be grouped into the recurrent neural network (RNN)-based methods and the graph convolutional network (GCN)-based methods.
Methods in the first category adopt RNN to encode contextual information for each object. For example, Xu et al. [9] adopted a gated recurrent unit approach to adjust message passed between nodes in a fully connected graph that is composed of all objects in the image. Zeller et al. [10] obtained the contextual information via a bi-directional long short-term memory network that arranges the objects in a chain structure. Moreover, to capture the dynamic nature of visual contexts, Tang et al. [11] proposed a tree structure that is optimized by a reinforcement learning-based method.
The second category of methods utilized GCN for message-passing purposes. For example, Li et al. [12] employed a bottom-up clustering method to factorize the entire graph into subgraphs. The features of objects and subgraphs are refined through a spatially weighted message-passing module based on GCN. In [13], a relation proposal network is proposed to measure relatedness scores between object pairs and prune unlikely relations. An graph attention network is also introduced to propagate higher-order contextual information throughout the graph. The representation of each object or relationship is therefore updated based on its neighbors. By considering that edge direction may influence contextual information, Lin et al. [14] proposed a directionaware message-passing (DMP) module. The DMP adopts a tri-linear model to encode edge information into an attention map that guides message passing. However, the above message-passing methods are based on individual objects, which may not work well for objects suffering from heavy occlusion. As discussed in Section I, occluded objects usually have a similar visual appearance; accordingly, message-passing modules will encode similar contextual information for these objects, achieving only limited benefit for occluded object prediction. In a departure from existing work, we group occluded objects into subsets and then refine their features via a subset-based messagepassing method. Subsequently, we jointly predict the object categories in each subset.

B. GRAPH SIMILARITY LEARNING
Over the past few decades, many techniques have been developed for studying the similarity of graphs. In the early stages, multiple graph similarity metrics were defined, including Graph Edit Distance [15], Maximum Common Subgraph [16], and Graph Isomorphism [17] to address the graph similarity search and graph matching problems. However, the computation of these metrics is generally an NP-complete problem [19]; therefore, these approaches are only feasible for graphs of relatively small sizes. More recently, graph similarity learning has also been explored for applications in computer vision. In [20], context-dependent graph kernels are proposed to measure the similarity between graphs for human action recognition in video sequences. In [21], a deep model called Neural Graph Matching Network is first introduced for the 3D action recognition problem in the fewshot learning setting. Under this approach, interaction graphs are constructed from 3D scenes; here, the nodes represent physical entities in the scene, while edges represent interactions between the entities.
In this paper, we reduce the complexity of graph similarity measurement by decomposing the entire node and edge sets in a scene graph into subsets, respectively. Besides, current works focus primarily on measuring the similarity of unlabeled graphs whose nodes and edges do not have category labels [22]. In comparison, the nodes and edges in a scene graph own their specific categories and locations. Therefore, we jointly consider the bounding box regression and the node/edge classification in the matching loss.

III. MOTIVATION AND PROBLEM FORMULATION
Existing works [9], [11], [14] typically classify each element of a scene graph independently. However, this strategy may not be suitable for generating scene graphs in complex scenes. As illustrated in Figure 1, the occlusion issue, which frequently occurs in complex scenes, may cause errors during node and edge prediction. Therefore, it is more appropriate to jointly predict their categories as a set. Accordingly, we formulate the SGG problem as two set matching problems, i.e., node set matching and edge set matching, as follows: where D represents a metric measuring the distance between the predicted and ground-truth sets. The O and R stand for the ground-truth node sets and edge sets, respectively, withÔ andR being their predictions. Eq. (1) aims to make predictions for the entire node set and the entire edge set in the scene graph, respectively. Attempting to solve Eq. (1), traditional set prediction methods mainly utilize the Hungarian algorithm [44] to match the predicted set with its groundtruth. However, the complexity of the Hungarian algorithm increases polynomially as the number of elements in the set increases. Moreover, as mentioned in [45], a large number of training epochs is required to optimize Eq. (1). Therefore, it is necessary to explore approximations. One intuitive way of approximating Eq. (1) would be to decompose it into several subset matching problems. As illustrated in Figure 2, we regard two nodes whose bounding box IoU exceeds 0.5 as a crowded node subset O c . Specifically, according to the IoU value in descending order, each node has only one chance to be selected in building node subsets. Each of the remaining nodes that have small IoU with the others is duplicated to build a sparse node subset O s . Similarly, we categorize all edges with the same subject into four types of edge subsets. (1) R c : the edges connecting two nodes within one crowded node subset; (2) R cs : the edges connecting one crowded node subset and sparse node subset; (3) R cc : the edges connecting two crowded node subsets; (4) R ss : the edges connecting two sparse node subsets. Note that for R cs , the edges with the same object are also unified into one node subset. More details can be found in Figure 2.
Finally, using the above decomposed node and edge subsets, we can efficiently approximate Eq. (1), as follows: (2) whereÔ n andR q denote the predicted node and edge subsets, respectively; O n and R q represent the ground-truth node and edge subsets, respectively; N and Q stand for the number of subsets in the entire node set O and edge set R, respectively. Figure 2 illustrates the framework of SM-Net. We employ Faster R-CNN [24] to obtain object proposals for each image. We adopt exactly the same method used in [10] to obtain features for each proposal: these include the appearance feature, the vector of the object class prediction scores, and the relative coordinate of the proposal's bounding box in the image, which are represented as v i ∈ R 512 , s i ∈ R Co , and b i ∈ R 4 for the i-th node, respectively. C o further denotes the number of object categories (including the background. i.e., no object). Moreover, we extract the appearance feature from the union area of the two nodes i and j in the n-th node subset, denoted as v u ij ∈ R 512 . Following [10], we further extract the spatial feature, denoted as B ij ∈ R 512 for the two nodes i and j to describe their geometric relation; similarly,   [24] to obtain object proposals. It then improves the performance of SGG through the application of two novel modules: (1) an NSP module that refines the node subset feature with subset-specific contextual information and applies the node subset matching loss to penalize the gap between the predicted and ground-truth node subset. (2) an ESP module that obtains the edge subset feature using the subset-based edge fusion function and employs the edge subset matching loss to minimize the matching cost between the predicted and ground-truth edge subset.

A. OVERVIEW OF SM-NET
the relative location features for the union area of the n-th and m-th node subsets are represented as B nm ∈ R 512 . In more detail, the appearance features (i.e., v i and v u ij ) are obtained using the RoIAlign layer, followed by two successive FC layers in Faster R-CNN [24]. We model the spatial feature B ij using binary maps of size 14 × 14 × 2, with each channel representing the area of one node. Next, we apply two convolutional layers and two FC layers to the binary maps, finally obtaining a 512-dimensional representation. B nm is obtained in the same way as B ij , with each binary channel representing the area of a single node subset.
In this section, we will sequentially introduce the two modules of SM-Net. First, we introduce the node subset prediction (NSP) module in Section IV-B, then present the edge subset prediction (ESP) module in Section IV-C. Finally, the overall loss function and the SGG inference method are described in Section IV-D.

B. NODE SUBSET PREDICTION
NSP jointly predicts the two nodes in each node subset. Its predictions for the n-th node subset are denoted as follows: whereô (1) n andô (2) n ∈ R Co represent object class predictions for the two nodes, respectively. Moreover,b (1) n andb (2) n ∈ R 4 denote bounding box coordinates that associate each predicted object label with one of the nodes. NSP contains two main components: Subset-based Message Passing and Node Matching Loss. Subset-based Message Passing. Existing message-passing methods aim to refine the node representation with contextual information. Taking the GCN-based SGG models [13], [14], [30], [58] as an example, the feature for the i-th node is refined by the surrounding nodes as follows: where N i denotes the set of nodes that are neighbors of the i-th one. W t1 and W t2 are projection matrices. α(x i , x j ) computes the attention weight between nodes i and j. One popular way of computing this weight is as follows [13], [27]: where w is a projection vector and [; ] represents the concatenation operation. Due to the highly similar appearance features, two nearby nodes tend to obtain relatively high attention weights regardless of whether their object categories are related. Consequently, it may still hard to distinguish between spatially proximate nodes through the above message passing model. To deal with this issue, we propose a subset-based message-passing (SMP) module to refine the features of node subsets. Firstly, we merge the multi-modal features of two nodes to represent their subset. Specifically, the feature for the n-th node subset, which includes nodes i and j, can be obtained as follows: where ⊙ represents Hadamard product. W u1 ∈ R 512×1024 , W u1 ∈ R 512×2Co , and W u3 ∈ R 512×512 are the projection matrices. Secondly, for the n-th node subset, we aggregate all its related pairwise messages as c n . More specifically, c n is obtained via the following three steps: (1) we calculate the message sent from the m-th node subset as follows: where W z1 , W z2 , and W z3 ∈ R 512×512 are learnable projection matrices. The first term on the right-hand side of the equal sign aims to encode the correlations between the visual features of the two node subsets, while the second term learns their relative spatial relation; (2) We stack all N −1 messages to create a matrix Z n ∈ R 512×(N −1) ; (3) We apply rowwise max-pooling to Z n and obtain the overall contextual information c n . Finally, we update the feature for the n-th node subset as follows:Û Compared with existing message passing mechanisms [13], [14], [30] defined in Eq. (4), our SMP has three main advantages. First, two nearby nodes with highly similar appearance features are unified into one node subset and will not receive messages from each other. Second, z nm is able to store more correlation information than a scalar weight. Third, as revealed in [62], the row-wise max-pooling on Z n could reduce the noise that is induced by assuming that all node subsets interact with each other. Node Matching Loss. Having obtained the refined representation for each node subset using SMP, we utilize two FC layers to make predictions for the two nodes in each subset as follows: where W a and W b ∈ R (Co+4)×512 denote the weights of the FC layers. As two types of node permutations exist in each subset, we further design a loss to penalize the gap between the predicted subsetÔ n and the ground-truth subset O n . Inspired by the Wasserstein distance measurement [46], we introduce the following matching cost function to minimize the Earth Mover's Distance betweenÔ n and O n : where π represents a certain node permutation, while π k denotes the k-th item of π. Accordingly, (o π k n , b π k n ) ∈ O n refers to the ground-truth object label and bounding box coordinates of the π k -th node. L c (·) and L r (·) further represent the cross-entropy and ℓ 1 regression loss functions, respectively. Intuitively, Eq. (10) first looks for the "best match" between elements in the predicted and ground-truth node subsets and then computes the final loss.

C. EDGE SUBSET PREDICTION
ESP jointly predicts the two edges in each edge subset. The prediction for the q-th edge subset can be denoted as follows: wherer (1) q andr (2) q ∈ R Cr represent two relationship class predictions for the q-th edge subset, while C r refers to the number of relationship categories (including the nonrelationship). Moreover,b (1) q andb (2) q ∈ R 4 denote the coordinates of the union bounding box for the two nodes connected by the edge. ESP module comprises two components: Subset-based Reasoning and Edge Matching Loss. Subset-based Reasoning. Existing works [10], [11], [14], [25], [28] tend to merge multimodal features, e.g., the appearance feature, the vector of object class prediction scores, and the spatial features of two nodes, to represent their edges. More formally, the edge feature from the j-th to the k-th nodes can be represented as follows: F represents a fusion function, e.g., linear function [9], [25], high-order fusion function [14], [43]. Using the obtained edge feature, existing works typically make predictions for each edge independently. However, Eq. (12) may result in indistinguishable representations for the crowded node subsetrelated edges. This is because these edges may connect nodes with high IoU, and therefore very similar features. To address the above problem, we propose a new fusion function to obtain the edge subset representation. In the below, we take the q-th edge subset, which is composed of R cs edges that connect the n-th to m-th node subsets, as an example to describe the fusion function: where nodes i and j belong to the n-th node subset, while node k is from m-th node subset. * denotes a fusion function defined in [42]: where W x and W y project x, y to 512-dimensional space, respectively. σ(·) denotes the sigmoid function. W s and W o ∈ R 512×512 are two projection matrices. v u ijk ∈ R 512 represents the appearance feature of the union area for the three nodes i, j, and k. B ijk ∈ R 512 represents the relative spatial feature that encodes the union area of the n-th node subset and the node k in different channels. σ(W s v u ij * W o v k ) plays the role of feature selection from v u ijk and B ijk . For R cc and R cs , we can consider nodes i and j as both belonging to the same node subset. For edge subsets composed of R c or R ss edges, we can consider the nodes i and j as the same node. Edge Matching Loss. After obtaining the edge subset representation by subset-based edge fusion, we adopt two FC layers to predict the two edges in each edge subset as follows: where W c and W d ∈ R (Cr+4)×512 denote two projection matrices. We utilize the following matching cost function to penalize the difference between the q-th predicted and the ground-truth edge subsets: where ϕ denotes a specific edge permutation in the subset, while ϕ k denotes the k-th item of ϕ and (r ϕ k q , b ϕ k q ) ∈ R q refers to the ground-truth. VOLUME 4, 2016

D. SGG BY SM-NET
During training, the overall loss function L of SM-Net can be written as follows: During testing, the object categories of nodes {ō (1) n ,ō n } for the n-th node subset are predicted by: where C o represents the set of object categories. The predicted bounding boxesb (1) n andb (2) n are utilized to assign the predicted object categories to a single node, according to the IoU value between the bounding box of one node and the predicted bounding box. The relationship categories of edges {r (1) q ,r (2) q } for the q-th edge subset are obtained as follows: where C r represents the set of relationship categories. The predicted bounding boxesb (1) q andb (2) q are utilized to assign the corresponding edge category to one node pair that is relevant to the edge subset, according to the IoU value between the union box of the node pair and the predicted bounding box. Moreover, we enforce two predictions for all types of node and edge subsets to facilitate the form unity in Eq. (3) and Eq. (11). However, for the O s node subset, as well as R ss and R c edge subsets, there is essentially only one prediction required. In this case, we retain only the most confident prediction for these subsets.

V. EXPERIMENTS
To demonstrate the effectiveness of SM-Net, we conduct exhaustive experiments on three datasets: Visual Genome (VG) [6], Visual Relationship Detection (VRD) [7], and Open Images (OI) [8]. In this section, we report the evaluation settings and implementation details, after which we conduct ablation studies and comparisons with state-of-the-art methods.

A. EVALUATION SETTINGS
Visual Genome: For this dataset, we follow the same data cleaning strategy [9] that has been widely adopted in recent works: specifically, the most frequently occurring 150 object categories and 50 relationship categories are utilized for evaluation. The scene graph for each image consists of 11.6 objects and 6.2 relationships on average. The data is then split into one training set and one testing set: the training set contains 75,651 images, with 5,000 images used as a validation subset, while the testing set comprises 32,422 images. Following existing works [13], [18], [23], [27], [28], [31], performance evaluation is conducted under the following three standard settings: 1) Scene Graph Detection (SGDET). Given an image, the model detects object bounding boxes and predicts both the object category for each bounding box and the relationship category for each bounding box pair. 2) Scene Graph Classification (SGCLS). Beginning from the ground-truth location of object bounding boxes, the model predicts both object and relationship categories. 3) Predicate Classification (PREDCLS). Using the ground-truth object bounding boxes and their object categories, the model predicts only the relationship categories. All three settings are evaluated according to Recall@K (R@K) metrics, where K is set to 20, 50, and 100, respectively. Furthermore, as there are severe class imbalance problems for the relationship categories in VG, we also adopt the mean Recall@K (mR@K) metric [30] for performance evaluation. Visual Relationship Detection: This dataset contains 5,000 images with 100 object categories and 70 relationship categories. It further has around 30,000 relationship annotations, with an average of eight relationships per image. We use the same train/test data split protocol as in [7], i.e., 4,000 images for training and 1,000 images for testing. Moreover, following [7], we evaluate the Relationship Detection and Phrase Detection tasks. The Relationship Detection task requires both object detection and relationship prediction between each pair of object bounding boxes. The Phrase Detection task detects the subject-object union boxes and predicts one triplet for each bounding box. We subsequently rank all retained predictions and report the R@50 and R@100 performance, respectively. Open Images: The complete training and validation sets of OI contain 53,953 and 3,234 images, respectively. The dataset includes 57 object categories and 10 relationship categories. To conduct evaluation on the OI dataset, we follow the same evaluation metrics as RelDN [25]: namely, Recall@50, the weighted mean AP of relationships (wmAP rel ), and the weighted mean AP of phrases (wmAP phr ). More specifically, to address the extreme predicate class imbalance issue in OI, RelDN [25] scales the impact of each predicate category on the performance with reference to their ratios in the validation set, which is referred to as the weighted mAP (wmAP). wmAP rel evaluates the AP of the predicted triplets where both the subject and object boxes have an IoU above 0.5 with their ground-truth, respectively. wmAP phr is similar except that the IoU threshold is applied to the union box of one subject-object pair rather than their respective boxes. The final score is given by score wtd = 0.2 × R@50 + 0.4 × wmAP rel + 0.4 × wmAP phr .

B. IMPLEMENTATION DETAILS
Existing works tend to adopt different object detection backbones for their SGG models. To facilitate fair comparison with the majority of existing works, we here utilize ResNeXt-101-FPN [47], [48] as the backbone for the OI database; moreover, we adopt both ResNeXt-101-FPN [47], [48] and VGG-16 [49] as the backbone on the VG database; for VRD dataset, we utilize the VGG-16 [49] as the backbone. During training, we freeze the layers before the ROIAlign layer in the backbone and optimize the remaining layers of SM-   Net with both object detection and relationship classification objectives. We optimize SM-Net via Stochastic Gradient Descent (SGD) with momentum, with an initial learning rate of 10 −3 and batch size of 6. The learning rate is reduced by multiplying it by 0.1 if the performance on the validation set does not increase for two successive epochs. The total number of training epochs is set to 20. For the SGDET task on the VG database, we adopt the same two settings as in [10]. First, we only predict relationships for proposal pairs where the two boxes have overlap. Second, the top 64 object proposals in each image are selected following per-class nonmaximum suppression (NMS) with an IoU of 0.3. For all protocols in the VRD and OI databases, we select the top 100 object proposals in each image following per-class NMS with an IoU threshold of 0.5 and then predict the relationships for all obtained proposal pairs. Moreover, the ratio between pairs without any relationship and those with relationships during training on all three databases is set to 3:1.

C. COMPARISONS WITH STATE-OF-THE-ART METHODS
In the following, we compare the performance of SM-Net with state-of-the-art methods on the VG, VRD, and OI    We adopt the same evaluation metric as in [25]. Figure 3, SM-Net consistently outperforms state-ofthe-art methods on both the Recall and Mean Recall metrics, indicating that SM-Net has advantages when handling the SGG class imbalance problem. Moreover, as illustrated in Figure 3, the largest performance gain achieved by SM-Net lies in the relationships that suffer from the occlusion issues, e.g., carrying, riding, covered in.

and
Visual Relationship Detection: We compare the performance of SM-Net with state-of-the-art methods on the VRD dataset. As shown in Table 3, SM-Net consistently achieves superior performance under both relation detection and phrase detection metrics. In more detail, SM-Net outperforms one of the most recent methods, i.e., Seq2Seq-RL [53], by 1.4% and 0.6% at R@100 on the relation detection and phrase detection tasks, respectively.
Open Images: We compare the performance of SM-Net with state-of-the-art methods in    [52]. Specifically, we show the comparisons at R@100 in the SGCLS setting in the first row. In the second row, we show the comparisons at R@100 in the PREDCLS setting. The green color indicates correctly classified objects or predicates; the red indicates those that have been misclassified. Best viewed in color.

D. ABLATION STUDY
In the following, we systematically investigate the effectiveness of each key component in SM-Net. Experiments are conducted on the most popular VG dataset. Results are summarized in Table 5. Besides, the qualitative comparisons between SM-Net and one very recent model, named SG-GNLS [52] are shown in Figure 4.
To justify the advantages of set prediction, we adopt two baselines in Table 5: a "BI" baseline that predicts each element in the scene graph independently and a "BS" baseline that jointly predicts elements in each subset. Specifically, "BI" and "BS" baselines utilize the concatenation fusion function to obtain the node/edge feature and node/edge subset feature, respectively. Furthermore, "BI" baseline utilizes the cross-entropy loss to obtain the node/edge prediction. "BS" baseline applies the loss defined in Eq. (16) to obtain the node/edge subset prediction. From the results, "BS" baseline outperforms "BI" baseline in both SGCLS and PRED-CLS tasks, demonstrating the set prediction's effectiveness. Note that Exp. 1-3 are all based on the "BS" baseline. Effectiveness of the Proposed Modules. We first perform an ablation study to justify the effectiveness of NSP and ESP modules. The results are summarized in Table 5. Exps 1-3 show that each module helps to promote the performance of SGG. The best performance is achieved when both mod-   ules are involved. Note that NSP and ESP are designed to refine object and relationship representations, respectively. Therefore, NSP helps the model achieve outstanding SGCLS performance, which heavily depends on the object classification ability. Meanwhile, ESP enables the model to achieve a significant performance gain on the PREDCLS task, mainly relying on relationship prediction power. Qualitative Comparisons. Figure 4 presents a qualitative comparison between SM-Net and SGGNLS [52]. As can be seen from the first row of Figure 4, SM-Net makes better node predictions than SGGNL [52] for book and plate that are hard to recognize from its proposal. We owe this performance gain to the NSP module that refines the node subset via subset-based message-passing and jointly predicts the object categories in each node subset. As shown in the second row of Figure 4, SM-Net can correctly identify the relationship between woman and motorcycle as sitting on rather than riding. We give this credit to the ESP module that jointly considers the edge categories in each edge subset.

E. FURTHER ANALYSIS AND DISCUSSIONS
We conduct additional experiments to further verify the design choices for NSP and ESP modules. Results are summarized in Table 6 and Table 7, respectively. Comparisons between Three MP Modules. In Table 6(a), we compare the performance of SMP with another two MP modules. The first one is implemented according to Eq. (4) and Eq. (5); we refer to it as Global Context MP (GCMP) [13], [30]. The second one is the Direction-aware MP (DMP) module [14]. These three modules utilize the node subset feature as input and produce refined node subset features with reference to the contextual information. As shown in Table 6(a), SMP outperforms GCMP and DMP by 0.5% and 0.2% at R@100 on the SGCLS task, respectively. We give this mainly credit to the SMP formulates the contextual information as a vector, storing more complex correlation information than the simple scalar weights used in GCMP and DMP.
Comparisons between Three pooling Strategies. In Table  6(b), we test the performance of SM-Net with different pooling strategies for the SMP. "SUM", "Aver.", and "Max" represent the sum pooling, average pooling, and max pooling, respectively. As shown in Table 6, SMP achieves the best performance when applying max pooling to obtain the contextual information. This maybe because max pooling operation could reduce the noise that can be induced by average or sum poolings, which oblige all the subsets to interact with each other.
Evaluation on the number of SMP layers. In Table 6(c), we show the performance of SM-Net with different numbers of SMP layers, ranging from two to five. The model performance improves consistently as the number of SMP layers increases. However, due to limitations on GPU memory size, we only conduct experiments up to five SMP layers.
Design Choices for ESP Module. In Table 7, we compare the performance of different fusion functions utilized in the ESP module. "Concat." denotes the concatenation fusion function utilized in "BS" baseline. "MFB" represents the multi-modal factorized bilinear fusion function applied in [62]. "SUM" and "GATE" denote the two fusion functions proposed in [42]. For comparison fairness, these fusion functions are all utilized to obtain the subset-level edge representation. As illustrated in Table 7, our SR fusion function achieves the best performance. This may be because the feature selection-based fusion function suppresses the redundant features in the union area.

VI. CONCLUSION
In this work, we study the occlusion problem in SGG and propose a novel framework named SM-Net. Specifically, we decompose SGG into node subset matching and edge subset matching problems. Unlike existing works which classify each element in scene graph independently, SM-Net jointly predicts the categories of the nodes and edges within the same node and edge subsets, respectively. With the subset matching strategy, the correlation between nearby nodes and the correlation between their related edges are both considered; therefore, SM-Net can alleviate the occlusion problem and is more robust for generating scene graphs in complex scenes. Extensive experimental results on three popular SGG datasets justify the effectiveness of SM-Net.