Introduction
With the increase of digital content, especially images in the real world, image understanding tasks such as classification and retrieval have become more important than ever to make the contents easy to access. However, most existing research focuses on single-image understanding whereas understanding an image collection is still challenging. In recent years, the task of understanding an image collection is focused in various applications [1], such as semantic image retrieval [2], [3], Web-image concept understanding [4], [5], and multiple-image summarization [6], [7], [8], [9]. Generating a summarized scene graph also shows an advantage in visual storytelling [10], [11] and video summarization [12] applications. The typical first stage in understanding an image collection is to understand the overall context and find a representation of it, e.g., in the form of words, sentences, or scene graphs. Compared to other methods, a scene graph has the advantage of its ability to represent the contexts of images by describing objects and their relationships. The task of generating a scene graph is used in various tasks such as single-image captioning [13], [14], image retrieval [3], [15], [16], and multiple-image context summarization [7]. However, scene-graph generation is commonly introduced to generate a scene graph of a single image. Whereas, summarizing an image collection into a summarized scene graph shows advantages in understanding the overall contexts and using it in image querying applications [6]. However, the common challenge in scene-graph summarization is estimating the relationships between different object category pairs detected in different images. In order to improve summarizing information of an image collection, we aim to understand the relationships between objects detected in different images by employing external knowledge graphs. Figure 1 shows an example of summarizing an image collection into a combined scene-graph representation, which can describe the overall context by estimating the similar concepts of their visual objects. For example, we humans can find the common occurring objects of an image collection which are cow, sheep, hill, and street, and their relationships such as cow-on-hill and sheep-on-street. Based on the external knowledge, where both street and hill are places whereas sheep and cow are animals. Then, based on the knowledge graph, we can understand that hill is a commonplace for animals. Therefore, we can assume the contexts as cow-on-hill, sheep-on-hill, and sheep-near-street. This example shows the advantage of using external knowledge in finding the relationships across multiple images.
Example of generating a scene-graph representation of an image collection. The dotted lines represent the semantic relationships, and the solid lines show the inferred relations of the summarized scene graph.
To generate a summarized scene graph of an image collection, a naïve approach [18], [19] summarizes images by incorporating external knowledge. However, the challenges of incorporating external knowledge are defining reasonable knowledge and estimating appropriate relationships between object categories. Based on these motivations, we have previously proposed a scene-graph summarization method using graph theory for generating a caption of an image collection [8], [9]. Thus, we needed to provide a concept generalization process that aims to find the common concept words from an image collection to refine the final caption, which was performed by generalizing words. However, it would reduce details in the final caption, such as replacing cow or sheep with animal instead of describing both as summarized information such as cow and sheep. Therefore, the common challenge is to find relationships between different objects without reduction of details. For example, in the case of sheep on street and cow on hill, if we can utilize the external knowledge where both street and hill are places for animals, we can conclude both are in similar contexts such as places for living. Thus, we can infer indirect relationships such as sheep-on-hill. Based on this idea, the proposed method enhances the relation predictor of the scene-graph generation process so that it can generate generalized relationships of objects which can grasp the relationships of different objects in the same category across images. Inspired by the use of external knowledge to generate a scene graph for an unseen image [20], [21], we have decided to follow this idea.
In order to realize a scene-graph summarization method of an image collection using external knowledge, there are three hurdles. First, we need to model external knowledge for the training process, for which we incorporate ConceptNet [22], a knowledge graph of commonsense semantic information. Second, we need to integrate the knowledge graph into the relation predictor of the scene-graph generation model. Lastly, we need to construct a summarized scene graph by combining all image information and then generate a final scene graph. Figure 2 shows a comparison between the generation of a summarized scene-graph with the summarization process in the inference phase of conventional methods and the proposed method. It demonstrates the case of finding a relationship between two sub-graphs by joining their location, hill.
Furthermore, it is also challenging to estimate the confidence score of each relationship to obtain a final scene graph, whereas a typical scene-graph generation method obtains the final scene graph based on confidence scores. To improve the estimation process, we employ PageRank [23] to re-calculate the node scores for selecting relationships in the process. However, a remaining challenge is the limitation of a dataset specific to the scene-graph summarization task. We hence construct a dataset for evaluating the proposed method based on the MS-COCO dataset [24] which is a popular image captioning dataset and widely used across various tasks including image retrieval and image summarization. In order to evaluate our work on a summarized scene graph, we introduce an evaluation process that evaluates the similarity score of a summarized scene graph based on the F-score whereas previous works focus only on precision. By this, the proposed evaluation process can account for false negatives.
Our contributions can be summarized as follows:
We propose a scene-graph summarization method to generate a summarized scene graph of an image collection that has indirect relationships by inference using the external knowledge graphs into the relation prediction process.
We introduce a sub-graph confidence score for estimating a summarized scene graph of an image collection.
We introduce an evaluation process for evaluating a summarized scene graph by calculating the F-score which evaluates both false positives and false negatives of a generated scene graph.
Related Work
In this section, we review related work on three topics; Image Collection Summarization which discusses work that aims to generate summarized information of an image collection, Scene-Graph Generation which discusses methods to generate image information in graph form, and Knowledge Graph which introduces the external knowledge that is used in the proposed method.
A. Image Collection Summarization
Image collection summarization is the task of generating a representative summary of an image collection. Traditionally, it aims to find a representative information in the form of image, textual, or scene-graph representation.
a: Image Representation
Summarizing an image collection is typically introduced in the photo album summarization task, which aims to find an image that represents an image album. Yu et al. [10] proposed a model composed of three hierarchically attentive Recurrent Neural Networks (RNNs) to encode album photos, select representative photos, and generate a story. Wang et al. [25] proposed a model with a hierarchical photo-scene encoder and reconstructor for generating an album story. Moreover, many works also find a representative image of an image collection using a clustering algorithm, such as Self-Organization Map (SOM) [26], [27] or
b: Textual Representation
Textual information is a popular summarization form for an image collection summarization task, which is represented as keywords, tags, phrases, or sentences. In summarizing an image collection into keywords or tags, Samani and Moghaddam [28] proposed a semantic summarization method for an image collection that utilizes the domain ontology as an input of the system by providing knowledge about the concept domain, e.g., Colosseum and Trevi Fountain. Zhang et al. [29] proposed a model to analyze an image collection and generate appropriate visual summaries and textual topics, e.g., sunset, sky, and sun. For summarizing an image collection into phrases, Trieu et al. [7] proposed a new task named multi-image summarization, that aims to generate a descriptive summary of an image collection such as styles of bags. They also introduced a new dataset for this task by collecting 2.1 million images from Web pages and then building collections of images, each consisting of at least five images. Li et al. [6] introduced a new task called context-aware captioning, which aims to describe an image collection in another context from different image collections. We [8], [9] introduced a method to generate a caption of an image collection based on a summarized scene graph based on graph theory [30].
c: Scene-Graph Representation
As scene graphs are widely used for describing visual objects and their relationships for a single image [31], they are also used for describing multiple images. Pasini et al. [2] proposed an image-collection summarization method based on frequent subgraph mining and represents an image collection in a sub-graph form on the MS-COCO dataset [24]. Yang et al. [32] introduced a challenging task, named Panoptic Video Scene Graph Generation (PVSG), which aims to generate a summarized scene graph of real-world data and contribute a new panoptic video dataset for this task.
In the proposed method, we aim to describe an image collection by a scene graph, focusing on integrating external knowledge into the learning process.
B. Scene-Graph Generation
Scene-graph generation [31] is a popular technique in describing relationships between objects in an image. The relationships of objects are generally represented in triplets which consist of subject, predicate, and object. A common scene-graph generation architecture is divided into two main processes comprising object detection to detect the objects inside the image and relationship prediction to find the edges between the objects. In recent years, it has been widely introduced and implemented on the Visual Genome dataset [33] and the MS-COCO dataset [24]. In addition, scene-graph generation is also adapted to various applications, such as image captioning [34] and image retrieval [3], and has been shown to improve their results. Various techniques are introduced in scene-graph generation; Neural Motif [35] is built with Faster R-CNN [36] with plenty of backbones, such as ResNet-50 [37] and ResNext-101 [38], then computes and propagates through Bidirectional Long Short-Term Memory (BiLSTM) [39] for predicting relations. VCTree [40] is a scene-graph generation technique composed of dynamic tree structures which show the advantage of the use of a binary tree in finding co-occurrence and usual relationships between objects by allowing a dynamic structure. Iterative Message Passing (IMP) [41] is an end-to-end scene-graph model using standard RNNs and improves the prediction via message passing. From the long-tail problem of the scene-graph dataset, the most recent work [42] aims to introduce a technique to solve the bias of the dataset. Relation Transformer for Scene Graph Generation (RelTR) [43] is a one-stage end-to-end scene-graph generation technique that uses an attention mechanism and gives a fixed number of subjects, objects, and relationships to generate a scene graph.
In the proposed method, we use scene graphs as a means to model the relationships between images in an image collection.
C. Knowledge-Graph
A knowledge-base is widely used to enrich models, especially text-generation models [44]. ConceptNet [22] and Wikipedia dataset 1 are popular knowledge-bases that are used in the generation process. ConceptNet is a knowledge-graph that represents general knowledge and commonsense information, while the Wikipedia dataset is structured knowledge data with detailed information on each topic. In recent years, the knowledge-graph has become a popular knowledge-base on various generation processes, mainly focusing on capturing commonsense reasoning during the generation. To tackle the long-tail issues of scene-graph generation mentioned above, integrating knowledge-graphs to improve the generation is a widely introduced strategy, and results show its advantage. Moreover, a knowledge-graph is additionally implemented in an image retrieval task which aims to reason on the semantic context and generalize the concepts inside an image [29].
In the proposed method, we use ConceptNet which is a knowledge-graph to enhance the relation predicator for finding unseen relationships across images.
Proposed Method: Scene-Graph Summarization Model
From the idea of enhancing the relation predictor with external knowledge for predicting unseen relationships, we build the proposed method by adapting an existing scene-graph generation method, Neural Motif [35]. The proposed method starts with extracting visual features from each image and then finds contextualized representations of each image following the Neural Motif approach. Next, we incorporate external knowledge into all contextualized representations. Lastly, we predict the relationship of each object in the contextualized representations and reconstruct them as a summarized scene graph as illustrated in Fig. 3.
Overview of the proposed method consisting of five components: (A) Object Detection detects features from each image in an image collection, (B) Object Context Construction constructs the contextualized representation of the estimated context of each image, (C) External Knowledge Integration finds the knowledge graphs based on the object contexts and integrates the knowledge graphs and contextualized representations, (D) Relation Prediction predicts relationships between each combination of object contexts and contextualized representations, and (E) Sub-Graph Confidence Score Calculation calculates scores of all objects from the relation prediction and then generates a summarized scene graph as an Output.
The proposed method has five main components. The Object Detection component detects the visual features of images and modifies them for detecting objects in an image collection. The Object Context Construction component finds the contextualized representations of images. To generate a summarized scene graph from contextualized representations of an image collection, we introduce the External Knowledge Integration component to find the indirect relationships between detected objects and an encoder to encode them into the Relation Prediction component to generate a relationship between objects. Lastly, we introduce the Sub-Graph Confidence Score Calculation component that calculates the confidence scores of objects.
A. Object Detection
The first component is the Object Detection component that detects a set of region proposals; Faster R-CNN [45] with ResNet-101 [37] is used as a detector backbone which shows good performance in scene-graph generation [42] compared with other backbones [31]. Following the scene-graph generation, from each image, a set of region proposals
In the inference phase, we modify an object detector to parse multiple images into the Relation Prediction component that generates a summarized scene graph of an image collection. Based on a single-image scene-graph generation model, we build an object detector backbone to detect image features \begin{align*} F &= \left \{{[\mathbf {f}_{n,1}, \ldots \mathbf {f}_{N,M}]}\right \}_{n=1,..,N}, \tag{1}\\ B &= \left \{{[b_{n,1},\ldots,b_{N,M}]}\right \}_{n=1,..,N}, \tag{2}\end{align*}
B. Object Context Construction
The second component is the Object Context Construction component that constructs a contextualized representation of a set of region proposals by concatenating them into a linear sequence which is sorted by detected locations, \begin{equation*} C = \textrm {biLSTM}([\mathbf {f}_{n};W_{1}\mathbf {l}_{n}]_{n=1,\ldots,N}), \tag{3}\end{equation*}
\begin{align*} \mathbf {h}_{n} &= \textrm {LSTM}_{n}([\mathbf {c}_{n};\widehat {\mathrm {o}}_{n-1}]), \tag{4}\\ \widehat {\mathrm {o}}_{n} &= \textrm {onehot}(\textrm {argmax}(W_{o}\mathbf {h}_{n})) \in \mathbb {R}^{|C|}\;, \tag{5}\end{align*}
C. External Knowledge Integration
Based on the idea of integrating external knowledge to enhance the relation predictor, there are two stages. First, we build a knowledge-graph based on the external knowledge from ConceptNet [22]. Then, we build the encoding layer to encode the external knowledge for incorporating it into the relation predictor, whose knowledge-graphs are built based on the class labels of a set of object contexts,
1) Knowledge-Graph Construction
The objective here is to build a word-embedding knowledge-graph from ConceptNet. Since ConceptNet provides various aspects of relation information, we build a knowledge-graph focusing on semantic relations consisting of “ relatedTo ”, “ similarTo ”, and “ synonym ” to improve the relation prediction of similar objects. In the building process, we first initialize the word collection to retrieve the semantic relations from VG200 [46] which consists of 150 labels by giving a class pair,
2) Knowledge-Graph Integration
For the External Knowledge Graph Integration component, we first build a Graph Convolutional Network (GCN) using the GlobalSortPool operator [48], which enables learning from nodes on graph topology instead of summing them up, as an encoder for a knowledge-graph. Then, all knowledge-graphs of class pair \begin{equation*} \mathbf {e}_{\textrm {kb}}^{(x,y)} = \mathrm {GlobalSortPool}(\mathbf {N}^{(x,y)}), \tag{6}\end{equation*}
In the training and evaluating processes, we first retrieve all predicted class pairs from the object context as
D. Relation Prediction
The obtained object contexts by the previous process are used in the Relation Prediction component, in which a set of regions, \begin{equation*} E = \textrm {biLSTM}([\mathbf {c}_{n};W_{2}\widehat {\mathrm {o}}_{n}]_{n=1,\ldots,N}), \tag{7}\end{equation*}
\begin{align*} \mathbf {g}_{i,j} &= (W_{h}\mathbf {e}_{i})(W_{t}\mathbf {e}_{j})\mathbf {f}_{i,j}, \tag{8}\\ \mathbf {r}_{i,j} &= \textrm {argmax}([\mathbf {g}_{i,j};\mathbf {e}_{\textrm {kb}}^{(i,j)}]W_{r}), \tag{9}\end{align*}
E. Sub-Graph Confidence Score Calculation
From the implementation of multiple images to generate a summarized scene graph which aims to generate all possible relationships across images, we also need to re-estimate the relationship scores in a generated scene graph instead of using only confidence scores. The estimation aims to estimate triplet scores which are calculated from subject, predicate, and object confidence by analogy of PageRank [23].
To calculate a score, we first find summarized scores of each object using its confidence score as object scores as:\begin{equation*} \mathrm {obj\_{}score}_{i} = \sum _{j=0}^{N}\mathrm {obj\_{}confidence}_{i,j}, \tag{10}\end{equation*}
\begin{equation*} \mathrm {mean}_{\mathrm {obj}} = \frac {1}{M} \sum _{i=0}^{M}\mathrm {obj\_{}score}_{i}, \tag{11}\end{equation*}
Lastly, we collect the object pairs whose relation scores are greater than the mean score and employ PageRank to calculate the confidence score of each object whose process is detailed in Algorithm 1.
Evaluation Process
Due to the lack of ground truth for this task, we use common metrics that are used in image collection scene-graph summarization tasks [2], [19]; similarity [16], [49], [50], coverage [28], [51], and diversity [52], [53] of a generated scene graph to the ground-truth scene graph of each image. However, most evaluation techniques focus on estimating the generating precision, in which the evaluation score tends to increase based on the quantity of the generated results. As such, we introduce an evaluation process which focuses on evaluating the quality of a summarized scene graph using F-score based on estimating the similarity between scene graphs. Since the estimation of the similarity between scene graphs has been attempted with various approaches, the technique of using word embedding shows a better qualitative estimation in scene-graph generation [50].
Given a ground-truth scene graph
Overview of the evaluation process consisting of three components: From Candidate Triples and Reference Triplets, Word Embedding encodes both of them into a vector form, (A) Triplet Score Calculation calculates all triplet similarities between candidates and references, (B) Maximum Value Selection finds the maximum value of the similarity scores of each triplet pair, and (C) Graph Similarity Score Calculation calculates the final score.
The calculation process is adapted from BERTScore [54] to the evaluation process. In the BERTScore calculation, first, they implement Bidirectional Encoder Representations from Transformers (BERT) [55] embedding to tokenize all words of candidate and reference sentences into a vector form. Next, it calculates the similarity score between all words and then selects the maximum score of each word based on greedy matching. Lastly, it calculates the precision score, recall score, and F-score as evaluation metrics of a candidate sentence. From this process, we can also evaluate the false negative of a candidate scene graph, whereas other evaluation techniques mainly focus only on precision. Thus, in the proposed evaluation process, from all candidate triplets and reference triplets, we first encode all triplets into vector forms. Next, we calculate the similarity score between each triplet of all reference triplets and all candidate triplets. Then, we select the maximum similarity score of each candidate triplet calculation. Lastly, we calculate the precision score, recall score, and consecutively, F-score as a scene graph similarity score. Details of each process are described below.
A. Triplet Score Calculation
Given a generated triplet in a vector representation \begin{equation*} S(\mathbf {a},\mathbf {b})=\frac {\mathbf {a}\cdot \mathbf {b}}{\left \|{ \mathbf {a} }\right \|\cdot \left \|{ \mathbf {b} }\right \|}, \tag{12}\end{equation*}
As the similarity between subjects or objects is calculated based on word similarity, the similarity between predicates is specifically estimated in the definition of entity-based similarity. The calculation of the scene-graph similarity focuses on the relationship between objects, which can reduce redundant information [49]. We employ the calculation of the similarity between predicate \begin{align*} S_{\textrm {pred}}(\mathbf {p},\widehat {\mathbf {p}}) = \begin{cases} \displaystyle 1 \;\;\; \mathbf {p} = \mathbf {\widehat {p}};\\ \displaystyle 0 \;\;\; \mathbf {p} \neq \mathbf {\widehat {p}}. \end{cases} \tag{13}\end{align*}
In the following, given all similarity scores of triplet \begin{equation*} M_{\mathrm {sim}}(\mathrm {t}_{i},\widehat {\mathrm {t}_{j}}) = \textrm {mean}({{\left \{{S_{\textrm {sub}}(\textbf {s}_{i},\widehat {\textbf {s}}_{j}), S_{\textrm {pred}}(\mathbf {p}_{i},\widehat {\mathbf {p}}_{j}), S_{\textrm {obj}}(\textbf {o}_{i},\widehat {\textbf {o}}_{j})}\right \}}}). \tag{14}\end{equation*}
B. Maximum Value Selection
From similarity scores between triplets, we use the maximize function to find the maximum matching score and then summarize all maximum matching scores \begin{equation*} \mathrm {Max}_{\mathrm {SGSim}} = \sum _{t_{i} \in G}\underset {\widehat {t}_{j} \in \widehat {T}}{\textrm {max}}(M_{\mathrm {sim}}(t_{i},\widehat {t}_{j})). \tag{15}\end{equation*}
C. Graph Similarity Score Selection
To estimate the final similarity score between scene graphs, we first calculate the recall score with the ratio of the sum of the maximum similarity score to the norm of a ground-truth graph as:\begin{equation*} R_{\mathrm {SGSim}}=\frac {1}{\left |{ G }\right |}\mathrm {Max}_{\mathrm {SGSim}}. \tag{16}\end{equation*}
Then, we calculate the ratio of the sum of the maximum similarity scores to the norm of a generated graph as:\begin{equation*} P_{\mathrm {SGSim}}=\frac {1}{| \widehat {G} |}\mathrm {Max}_{\mathrm {SGSim}}. \tag{17}\end{equation*}
Lastly, the mean of \begin{equation*} F_{\mathrm {SGSim}}=2\frac {P_{\mathrm {SGSim}} R_{\mathrm {SGSim}}}{P_{\mathrm {SGSim}} + R_{\mathrm {SGSim}}}. \tag{18}\end{equation*}
We demonstrate in Algorithm 2 the calculation process for all triplets of a summarized scene graph and a ground-truth scene graph.
Experimentals
A. Dataset
Due to the lack of image summarization datasets, we adapt two datasets, including an image captioning dataset, MS-COCO [24], and a visual scene-graph dataset, Visual Genome [33], for the experiments. For the training process and the preliminary evaluation, we use the VG200 dataset [46], which is based on the Visual Genome dataset consisting of 50 relationships and is balanced in category frequency. It contains 101,174 images from the MS-COCO dataset. To experiment on a scene-graph image-collection summarization task, we build a testing set of an image collection with annotation by grouping images in the MS-COCO testing dataset using VSE++ [56], which is an image retrieval task by estimating the similarity of image contexts and image captions. Following the Karpathy split [57] on the MS-COCO dataset, the initial testing set was selected from 5,000 images of the MS-COCO testing set. Then, we retrieved 5 images, with each image annotated with 5 captions, to build a collection, making our testing set contain 5,000 collections with 6 images each. Lastly, we build the ground truth of each image collection for the evaluation process in a scene-graph form.
As image summarization aims to generate summarized information that can describe the overall contexts of an image collection and the limitation of the ground truth in the proposed method, we use Neural Motif [35] pre-trained on the VG200 dataset and evaluated on Scene-Graph Detection Recall (SGDet R@100), to generate a scene graph of each image in a collection for evaluation. Then, we consider them as the ground truth of each collection for evaluating the proposed method which makes each collection consisting of 6 ground-truth scene graphs.
B. Training Strategy
With the lack of ground truth in scene-graph summarization datasets, we first train and evaluate the scene-graph generation on a single-image from the VG200 dataset. In the training phase, we train the model following the VG200 dataset where the number of labels and predicates are 150 and 50, respectively. The learning rate is initiated to 0.12. We use Adam [58] for optimization and cross-entropy loss as the loss function. To pre-evaluate the model for multiple-image scene-graph summarization, we observed a SGDet recall to select the best checkpoint for the proposed method.
C. Evaluation
As the proposed method is modified from a single-image scene-graph generation approach, we consider evaluating the proposed method in two aspects. First, Multiple-Images Scene-Graph Summarization evaluates the proposed method for an image-collection scene-graph summarization. Second, Single-Image Scene-Graph Generation evaluates the proposed method to confirm that it is still sustainable for a single-image scene-graph generation. Lastly, we benchmark the evaluation process in Benchmark for the Evaluation Process to show the accountability for scene-graph generation.
1) Multiple-Images Scene-Graph Summarization
For multiple-images scene-graph summarization, we evaluate the proposed method for image-collection scene-graph summarization on the MS-COCO dataset. Due to the lack of ground truth, we follow the common practice in the evaluation of scene graph generation in three perspectives; “ Coverage ” 28], [51, “ Diversity ” [52], [53], and “ Similarity ” [49], [50]. For the Coverage evaluation, we follow the graph theory to estimate the coverage of a generated scene graph to ground-truth scene graphs. For the Diversity evaluation, we implement two evaluation processes comprising graph diversity and Graph Edit Distance (GED) [59]. For the Similarity evaluation, we adopt a simple contrastive learning framework for connecting scene-graphs and images (GICON) [60] which is the evaluation technique by learning the similarity between an image and a scene graph with bounding boxes or without bounding boxes. Since the proposed method focuses on image collection summarization, we evaluate the proposed method only without bounding boxes. Lastly, we employ an evaluation process that evaluates the similarity of a summarized scene graph to the ground truth by SGSim proposed in Section IV.
2) Single-Image Scene-Graph Generation
For single-image scene-graph generation, we evaluate the performance on VG200 compared with the baseline to ensure that the proposed method still sustains good results. We evaluate three scene graph evaluation metrics; Scene Graph Classification Recall (SGCls Recall), which measures subjects, objects, and predicates using ground-truth bounding boxes, Predicate Classification Recall (PredCls Recall), which is the relationships prediction using ground-truth bounding boxes, subjects, and objects, and Scene Graph Detection Recall (SGDet Recall), which is the prediction of subjects, objects, and predicates without using the ground truth.
3) Benchmark for the Evaluation Process
Here, we discuss the evaluation metric to ablate the evaluation process. As it is proposed for evaluating scene-graph generation, we benchmark it on single-image scene-graph generation with the VG200 dataset by comparing it with other scene-graph generation baselines. As we aim to evaluate based on the false negative generation, we assess it with two evaluation metrics. First, Scene Graph Detection Recall (SGDet Recall) is a popular scene graph evaluation metric. Next, GICON is an evaluation metric from learning the similarity between a generated scene graph and an image with bounding boxes or without bounding boxes.
D. Baselines
As discussed in the previous section, the evaluation is divided into three tasks; multiple-images scene-graph summarization, single-image scene-graph generation, and the ablation study on the evaluation process. In this section, we introduce baseline methods corresponding to each of them.
1) Baseline for Multiple-Images Scene-Graph Summarization
To evaluate the proposed method in the multiple-images scene-graph summarization setting, we choose three baseline methods; Semantic Image Summarization (SImS) [2], Image Collection Captioning (ICC) [8], [9], and
2) Baseline for Single-Image Scene-Graph Generation
To evaluate the proposed method for single-image scene graph generation, we choose four baseline methods; Iterative Message Passing (IMP) [41] which uses the standard Recurrent Neural Network (RNN) via message-passing process, Neural Motif [35] which is implemented based on Stacked Motifs architecture, Transformer [42] which is based on causal inference, and Visual Context Tree (VCTree) [40] which takes advantage of the structured object representations. All of the baseline models are trained on the VG200 dataset. Then, we observe the best checkpoint on SGCls Recall, PredCls Recall, and SGDet Recall for evaluating the proposed method on the VG200 dataset.
3) Baseline for Ablation Study on the Evaluation Process
For the ablation study on the evaluation method, we aim to benchmark the evaluation process compared to other evaluation metrics. We choose four state-of-the-art scene-graph generation methods on the Visual Genome dataset; Neural Motif, Transformer, Relationship Detection Network (RelDN) [61], and Relation TRansformer (RelTR) [43].
E. Results
We report the results of the proposed method in the three evaluation tasks; multiple-images scene-graph summarization, single-image scene-graph generation, and the ablation study on the evaluation process.
1) Multiple-Images Scene-Graph Summarization
In this section, we discuss the results of the proposed method for an image collection summarization task. For comparison, we select top-10 scores in three aspects; Coverage, Diversity, and Similarity. The results are shown in Table 1.
a: Coverage
For Coverage evaluation, the coverage of objects and subjects (nodes), and predicates (edges) are evaluated based on graph theory [30]. The results in Table 1 show that
b: Diversity
For Diversity evaluation, we use two evaluation metrics which consist of Diversity and GED. Diversity evaluation refers to the similarity distance between a summarized scene graph and a ground-truth scene graph. The results in Table 1 show that the proposed method achieves the best scores in both Diversity and GED, whereas SImS achieves the second place for Diversity and
c: Similarity
For Similarity evaluation, Table 1 shows that the proposed method achieves the best score compared to the other methods on GICON which estimates the Similarity between scene graph and images, and the proposed evaluation process, SGSim which is evaluated based on scene-graph contents. Meanwhile,
d: Qualitative Results
Qualitative results are shown in Fig. 5. It demonstrates that the proposed method performs good in finding the relationship and estimating the commonly occurring information. For example, the first example shows how the proposed method can find all common object information, such as sheep and cow, and further estimate the commonsense relationship between them based on the location, hill. In contrast, SImS and
Comparison of the proposed method with baseline methods in three examples. (A) Proposed shows a summarized scene graph generated by the proposed method. (B) SImS demonstrates that by Semantic Image Collection Summarization [2]. (C)
From the overall evaluation scores, the proposed method achieves better scores in Diversity and Similarity perspectives, whereas
e: Limitations
There are two main limitations of the proposed method. First, the relationships are not grounded in visual information but rather built from the commonsense knowledge graph. As such, it might result in generated summaries not fully related to the actual image collection. Second, the method might not adjust well to large image collections, as it aims to estimate all possible relationships of all images. This can result in high memory requirements for summarizing large image collections.
2) Single-Image Scene-Graph Generation
The single-image scene-graph generation results are shown in Table 2. Since the objective of the proposed method mainly observes Scene Graph Detection (SGDet), we focus on its result when assessing a single-image scene-graph generation task. The result of SGDet R@100 shows that Nerual Motifs [35] and Transformer [42] achieve better results compared with the proposed method while the proposed method achieves better results compared with IMP [41] and VCTree [40]. In contrast, the results of R@20 and R@50 show that the proposed method achieves better results only compared with IMP [41]. As the proposed method aims to enhance the relation prediction toward unseen relationships, it is not restricted to the ground truth in a single-image scene-graph generation as shown in the result. As the proposed method shows better scores compared to IMP in SGDet evaluation, RelDN in SGCls, and IMP and RelDN in PredCls, it is still sustainable for a single-image scene-graph generation even if it cannot overcome other scene-graph generation baselines.
However, since the proposed method targets multiple-images summarization, this out-of-task evaluation was purely performed to understand the limitations of this approach.
3) Ablation Study on the Evaluation Process
We benchmark our evaluation process with existing methods that evaluate graph-oriented (graph structure) and graph similarity-oriented (similarity vertices and edges) compared to other evaluation methods. In the benchmark process, we construct a benchmark for single-image scene-graph generation for the VG200 dataset and perform analysis on four models; Neural Motif, Transformer, VCTree, and RelTR. For graph-oriented evaluation, we use Scene Graph Detection Recall (R@20, R@50, R@100) as a metric. For the graph similarity-oriented evaluation, we use GICON which is a learnable graph similarity metric for evaluating with bounding boxes (W/ Bounding Box) and without bounding boxes (Location Free). For each evaluation, we find the top-
The benchmark results in Table 3 report the results based on the number of retrieved triplets with confidence scores which shows the relevance to the rise of the scores. However, the high number of triplets does not always increase the similarity in the evaluation process, as shown in Transformer and RelTR. As RelTR is provided for inferring a fixed-size set of triplets, even if we increased the number from 30 to 50 triplets, the accuracy is still not significantly improved. Meanwhile, Transformer shows little improvement when increasing the retrieving number of triplets, and the other methods show significant improvement when increasing the retrieving number of triplets. Consequently, the other evaluation metrics, GICON and SGDets, focus on evaluating precision and recall, so the high retrieving number of triplets tends to result in high scores.
Conclusion
We introduced a scene-graph summarization method following the idea that aims to enhance the relation predictor in the training process for an image collection incorporating external knowledge. The results show that the proposed method can generate a summarized scene graph that is good in diversity and similarity perspectives compared with other baseline methods while it still lacks accuracy in terms of the coverage information. Additionally, the experimental results showed the advantage of using external knowledge in grasping the overall context of an image collection for finding the common relationships across images which is beneficial for a summarization task, especially, photo album summarization. However, the limitation is the lack of actual ground truth in the evaluation process. In the future, we plan to build a more suitable dataset for an image-collection scene-graph summarization task.
ACKNOWLEDGMENT
The computation was carried out using the General Projects on the supercomputer “Flow” at Information Technology Center, Nagoya University.