Loading [MathJax]/extensions/MathZoom.js
Image-Collection Summarization Using Scene-Graph Generation With External Knowledge | IEEE Journals & Magazine | IEEE Xplore

Image-Collection Summarization Using Scene-Graph Generation With External Knowledge


An image-collection summarization using a scene-graph generation approach incorporating an external knowledge graph to generate a summarized scene graph.

Abstract:

Summarization tasks aim to summarize multiple pieces of information into a short description or representative information. A text summarization task summarizes textual i...Show More

Abstract:

Summarization tasks aim to summarize multiple pieces of information into a short description or representative information. A text summarization task summarizes textual information into a short description, whereas an image collection summarization task summarizes an image collection into images or textual representation in which the challenge is to understand the relationship between images. In recent years, scene-graph generation has shown the advantage of describing the visual contexts of a single-image, and incorporating external knowledge into the scene-graph generation model has also given effective directions for unseen single-image scene-graph generation. While external knowledge has been implemented in related work, it is still challenging to use this information efficiently for relationship estimation during the summarization. Following this trend, in this paper, we propose a novel scene-graph-based image-collection summarization model that aims to generate a summarized scene-graph of an image collection. The key idea of the proposed method is to enhance the relation predictor toward relationships between images in an image collection incorporating knowledge graphs as external knowledge for training a model. With this approach, we build an end-to-end framework that can generate a summarized scene graph of an image collection. To evaluate the proposed method, we also build an extended annotated MS-COCO dataset for this task and introduce an evaluation process that focuses on estimating the similarity between a summarized scene graph and ground-truth scene graphs. Traditional evaluation focuses on calculating precision and recall scores, which involve true positive predictions without balancing precision and recall. Meanwhile, the proposed evaluation process focuses on calculating the F-score of the similarity between a summarized scene graph and ground-truth scene graphs, which aims to balance both false positives and false negatives. Experimental results s...
An image-collection summarization using a scene-graph generation approach incorporating an external knowledge graph to generate a summarized scene graph.
Published in: IEEE Access ( Volume: 12)
Page(s): 17499 - 17512
Date of Publication: 30 January 2024
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

With the increase of digital content, especially images in the real world, image understanding tasks such as classification and retrieval have become more important than ever to make the contents easy to access. However, most existing research focuses on single-image understanding whereas understanding an image collection is still challenging. In recent years, the task of understanding an image collection is focused in various applications [1], such as semantic image retrieval [2], [3], Web-image concept understanding [4], [5], and multiple-image summarization [6], [7], [8], [9]. Generating a summarized scene graph also shows an advantage in visual storytelling [10], [11] and video summarization [12] applications. The typical first stage in understanding an image collection is to understand the overall context and find a representation of it, e.g., in the form of words, sentences, or scene graphs. Compared to other methods, a scene graph has the advantage of its ability to represent the contexts of images by describing objects and their relationships. The task of generating a scene graph is used in various tasks such as single-image captioning [13], [14], image retrieval [3], [15], [16], and multiple-image context summarization [7]. However, scene-graph generation is commonly introduced to generate a scene graph of a single image. Whereas, summarizing an image collection into a summarized scene graph shows advantages in understanding the overall contexts and using it in image querying applications [6]. However, the common challenge in scene-graph summarization is estimating the relationships between different object category pairs detected in different images. In order to improve summarizing information of an image collection, we aim to understand the relationships between objects detected in different images by employing external knowledge graphs. Figure 1 shows an example of summarizing an image collection into a combined scene-graph representation, which can describe the overall context by estimating the similar concepts of their visual objects. For example, we humans can find the common occurring objects of an image collection which are cow, sheep, hill, and street, and their relationships such as cow-on-hill and sheep-on-street. Based on the external knowledge, where both street and hill are places whereas sheep and cow are animals. Then, based on the knowledge graph, we can understand that hill is a commonplace for animals. Therefore, we can assume the contexts as cow-on-hill, sheep-on-hill, and sheep-near-street. This example shows the advantage of using external knowledge in finding the relationships across multiple images.

FIGURE 1. - Example of generating a scene-graph representation of an image collection. The dotted lines represent the semantic relationships, and the solid lines show the inferred relations of the summarized scene graph.
FIGURE 1.

Example of generating a scene-graph representation of an image collection. The dotted lines represent the semantic relationships, and the solid lines show the inferred relations of the summarized scene graph.

To generate a summarized scene graph of an image collection, a naïve approach [18], [19] summarizes images by incorporating external knowledge. However, the challenges of incorporating external knowledge are defining reasonable knowledge and estimating appropriate relationships between object categories. Based on these motivations, we have previously proposed a scene-graph summarization method using graph theory for generating a caption of an image collection [8], [9]. Thus, we needed to provide a concept generalization process that aims to find the common concept words from an image collection to refine the final caption, which was performed by generalizing words. However, it would reduce details in the final caption, such as replacing cow or sheep with animal instead of describing both as summarized information such as cow and sheep. Therefore, the common challenge is to find relationships between different objects without reduction of details. For example, in the case of sheep on street and cow on hill, if we can utilize the external knowledge where both street and hill are places for animals, we can conclude both are in similar contexts such as places for living. Thus, we can infer indirect relationships such as sheep-on-hill. Based on this idea, the proposed method enhances the relation predictor of the scene-graph generation process so that it can generate generalized relationships of objects which can grasp the relationships of different objects in the same category across images. Inspired by the use of external knowledge to generate a scene graph for an unseen image [20], [21], we have decided to follow this idea.

In order to realize a scene-graph summarization method of an image collection using external knowledge, there are three hurdles. First, we need to model external knowledge for the training process, for which we incorporate ConceptNet [22], a knowledge graph of commonsense semantic information. Second, we need to integrate the knowledge graph into the relation predictor of the scene-graph generation model. Lastly, we need to construct a summarized scene graph by combining all image information and then generate a final scene graph. Figure 2 shows a comparison between the generation of a summarized scene-graph with the summarization process in the inference phase of conventional methods and the proposed method. It demonstrates the case of finding a relationship between two sub-graphs by joining their location, hill.

FIGURE 2. - Comparison between (A) other summarization methods [2], [8], [9], [17]; Scene-graph generation with summarization process and (B) proposed method; End-to-end scene-graph summarization.
FIGURE 2.

Comparison between (A) other summarization methods [2], [8], [9], [17]; Scene-graph generation with summarization process and (B) proposed method; End-to-end scene-graph summarization.

Furthermore, it is also challenging to estimate the confidence score of each relationship to obtain a final scene graph, whereas a typical scene-graph generation method obtains the final scene graph based on confidence scores. To improve the estimation process, we employ PageRank [23] to re-calculate the node scores for selecting relationships in the process. However, a remaining challenge is the limitation of a dataset specific to the scene-graph summarization task. We hence construct a dataset for evaluating the proposed method based on the MS-COCO dataset [24] which is a popular image captioning dataset and widely used across various tasks including image retrieval and image summarization. In order to evaluate our work on a summarized scene graph, we introduce an evaluation process that evaluates the similarity score of a summarized scene graph based on the F-score whereas previous works focus only on precision. By this, the proposed evaluation process can account for false negatives.

Our contributions can be summarized as follows:

  • We propose a scene-graph summarization method to generate a summarized scene graph of an image collection that has indirect relationships by inference using the external knowledge graphs into the relation prediction process.

  • We introduce a sub-graph confidence score for estimating a summarized scene graph of an image collection.

  • We introduce an evaluation process for evaluating a summarized scene graph by calculating the F-score which evaluates both false positives and false negatives of a generated scene graph.

SECTION II.

Related Work

In this section, we review related work on three topics; Image Collection Summarization which discusses work that aims to generate summarized information of an image collection, Scene-Graph Generation which discusses methods to generate image information in graph form, and Knowledge Graph which introduces the external knowledge that is used in the proposed method.

A. Image Collection Summarization

Image collection summarization is the task of generating a representative summary of an image collection. Traditionally, it aims to find a representative information in the form of image, textual, or scene-graph representation.

a: Image Representation

Summarizing an image collection is typically introduced in the photo album summarization task, which aims to find an image that represents an image album. Yu et al. [10] proposed a model composed of three hierarchically attentive Recurrent Neural Networks (RNNs) to encode album photos, select representative photos, and generate a story. Wang et al. [25] proposed a model with a hierarchical photo-scene encoder and reconstructor for generating an album story. Moreover, many works also find a representative image of an image collection using a clustering algorithm, such as Self-Organization Map (SOM) [26], [27] or $k$ -Medoids [17] to cluster images and then represent some of them as an image representation of an image collection.

b: Textual Representation

Textual information is a popular summarization form for an image collection summarization task, which is represented as keywords, tags, phrases, or sentences. In summarizing an image collection into keywords or tags, Samani and Moghaddam [28] proposed a semantic summarization method for an image collection that utilizes the domain ontology as an input of the system by providing knowledge about the concept domain, e.g., Colosseum and Trevi Fountain. Zhang et al. [29] proposed a model to analyze an image collection and generate appropriate visual summaries and textual topics, e.g., sunset, sky, and sun. For summarizing an image collection into phrases, Trieu et al. [7] proposed a new task named multi-image summarization, that aims to generate a descriptive summary of an image collection such as styles of bags. They also introduced a new dataset for this task by collecting 2.1 million images from Web pages and then building collections of images, each consisting of at least five images. Li et al. [6] introduced a new task called context-aware captioning, which aims to describe an image collection in another context from different image collections. We [8], [9] introduced a method to generate a caption of an image collection based on a summarized scene graph based on graph theory [30].

c: Scene-Graph Representation

As scene graphs are widely used for describing visual objects and their relationships for a single image [31], they are also used for describing multiple images. Pasini et al. [2] proposed an image-collection summarization method based on frequent subgraph mining and represents an image collection in a sub-graph form on the MS-COCO dataset [24]. Yang et al. [32] introduced a challenging task, named Panoptic Video Scene Graph Generation (PVSG), which aims to generate a summarized scene graph of real-world data and contribute a new panoptic video dataset for this task.

In the proposed method, we aim to describe an image collection by a scene graph, focusing on integrating external knowledge into the learning process.

B. Scene-Graph Generation

Scene-graph generation [31] is a popular technique in describing relationships between objects in an image. The relationships of objects are generally represented in triplets which consist of subject, predicate, and object. A common scene-graph generation architecture is divided into two main processes comprising object detection to detect the objects inside the image and relationship prediction to find the edges between the objects. In recent years, it has been widely introduced and implemented on the Visual Genome dataset [33] and the MS-COCO dataset [24]. In addition, scene-graph generation is also adapted to various applications, such as image captioning [34] and image retrieval [3], and has been shown to improve their results. Various techniques are introduced in scene-graph generation; Neural Motif [35] is built with Faster R-CNN [36] with plenty of backbones, such as ResNet-50 [37] and ResNext-101 [38], then computes and propagates through Bidirectional Long Short-Term Memory (BiLSTM) [39] for predicting relations. VCTree [40] is a scene-graph generation technique composed of dynamic tree structures which show the advantage of the use of a binary tree in finding co-occurrence and usual relationships between objects by allowing a dynamic structure. Iterative Message Passing (IMP) [41] is an end-to-end scene-graph model using standard RNNs and improves the prediction via message passing. From the long-tail problem of the scene-graph dataset, the most recent work [42] aims to introduce a technique to solve the bias of the dataset. Relation Transformer for Scene Graph Generation (RelTR) [43] is a one-stage end-to-end scene-graph generation technique that uses an attention mechanism and gives a fixed number of subjects, objects, and relationships to generate a scene graph.

In the proposed method, we use scene graphs as a means to model the relationships between images in an image collection.

C. Knowledge-Graph

A knowledge-base is widely used to enrich models, especially text-generation models [44]. ConceptNet [22] and Wikipedia dataset 1 are popular knowledge-bases that are used in the generation process. ConceptNet is a knowledge-graph that represents general knowledge and commonsense information, while the Wikipedia dataset is structured knowledge data with detailed information on each topic. In recent years, the knowledge-graph has become a popular knowledge-base on various generation processes, mainly focusing on capturing commonsense reasoning during the generation. To tackle the long-tail issues of scene-graph generation mentioned above, integrating knowledge-graphs to improve the generation is a widely introduced strategy, and results show its advantage. Moreover, a knowledge-graph is additionally implemented in an image retrieval task which aims to reason on the semantic context and generalize the concepts inside an image [29].

In the proposed method, we use ConceptNet which is a knowledge-graph to enhance the relation predicator for finding unseen relationships across images.

SECTION III.

Proposed Method: Scene-Graph Summarization Model

From the idea of enhancing the relation predictor with external knowledge for predicting unseen relationships, we build the proposed method by adapting an existing scene-graph generation method, Neural Motif [35]. The proposed method starts with extracting visual features from each image and then finds contextualized representations of each image following the Neural Motif approach. Next, we incorporate external knowledge into all contextualized representations. Lastly, we predict the relationship of each object in the contextualized representations and reconstruct them as a summarized scene graph as illustrated in Fig. 3.

FIGURE 3. - Overview of the proposed method consisting of five components: (A) Object Detection detects features from each image in an image collection, (B) Object Context Construction constructs the contextualized representation of the estimated context of each image, (C) External Knowledge Integration finds the knowledge graphs based on the object contexts and integrates the knowledge graphs and contextualized representations, (D) Relation Prediction predicts relationships between each combination of object contexts and contextualized representations, and (E) Sub-Graph Confidence Score Calculation calculates scores of all objects from the relation prediction and then generates a summarized scene graph as an Output.
FIGURE 3.

Overview of the proposed method consisting of five components: (A) Object Detection detects features from each image in an image collection, (B) Object Context Construction constructs the contextualized representation of the estimated context of each image, (C) External Knowledge Integration finds the knowledge graphs based on the object contexts and integrates the knowledge graphs and contextualized representations, (D) Relation Prediction predicts relationships between each combination of object contexts and contextualized representations, and (E) Sub-Graph Confidence Score Calculation calculates scores of all objects from the relation prediction and then generates a summarized scene graph as an Output.

The proposed method has five main components. The Object Detection component detects the visual features of images and modifies them for detecting objects in an image collection. The Object Context Construction component finds the contextualized representations of images. To generate a summarized scene graph from contextualized representations of an image collection, we introduce the External Knowledge Integration component to find the indirect relationships between detected objects and an encoder to encode them into the Relation Prediction component to generate a relationship between objects. Lastly, we introduce the Sub-Graph Confidence Score Calculation component that calculates the confidence scores of objects.

A. Object Detection

The first component is the Object Detection component that detects a set of region proposals; Faster R-CNN [45] with ResNet-101 [37] is used as a detector backbone which shows good performance in scene-graph generation [42] compared with other backbones [31]. Following the scene-graph generation, from each image, a set of region proposals $B = \left \{{ b_{1},\ldots,b_{n} }\right \}$ is predicted. Each region proposal $b_{i}$ consists of a feature vector $\textbf {f}_{i}$ and an object label probability $\textbf {l}_{i}$ for the training phase.

In the inference phase, we modify an object detector to parse multiple images into the Relation Prediction component that generates a summarized scene graph of an image collection. Based on a single-image scene-graph generation model, we build an object detector backbone to detect image features $\mathbf {f}_{n}$ and proposals $b_{n}$ . Then, we combine all image features and all proposals as:\begin{align*} F &= \left \{{[\mathbf {f}_{n,1}, \ldots \mathbf {f}_{N,M}]}\right \}_{n=1,..,N}, \tag{1}\\ B &= \left \{{[b_{n,1},\ldots,b_{N,M}]}\right \}_{n=1,..,N}, \tag{2}\end{align*} View SourceRight-click on figure for MathML and additional features. where $F$ is a set of feature vectors of all images, $N$ is the number of images, $M$ is the number of region proposals of each image, and $B$ is a set of proposals of all images. All predicted sets of region proposals and each region proposal consists of a feature vector $\mathbf {f}$ and an object label probability $\mathbf {l}$ , which is used in the Object Context Construction component.

B. Object Context Construction

The second component is the Object Context Construction component that constructs a contextualized representation of a set of region proposals by concatenating them into a linear sequence which is sorted by detected locations, $[(b_{1}, \mathbf {f}_{1}, \mathbf {l}_{1}), \ldots, (b_{N}, \mathbf {f}_{N}, \mathbf {l}_{N})]$ . Then, a bidirectional LSTM [39] is used as:\begin{equation*} C = \textrm {biLSTM}([\mathbf {f}_{n};W_{1}\mathbf {l}_{n}]_{n=1,\ldots,N}), \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $C$ is a set of object contexts, in which each object context contains the hidden state of each element in the linearization of $B$ , $W_{1}$ is a parameter that maps to the distribution prediction represented in the matrix, and $\mathbf {l}_{n}$ is a probability vector of object labels. Each object context is used to decode a class label with an LSTM as:\begin{align*} \mathbf {h}_{n} &= \textrm {LSTM}_{n}([\mathbf {c}_{n};\widehat {\mathrm {o}}_{n-1}]), \tag{4}\\ \widehat {\mathrm {o}}_{n} &= \textrm {onehot}(\textrm {argmax}(W_{o}\mathbf {h}_{n})) \in \mathbb {R}^{|C|}\;, \tag{5}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\mathbf {c}_{n}$ is an object context vector in a set of object contexts, $C$ , $\mathbf {h}_{n}$ is a hidden state that is used in the relation predictor, and onehot($\cdot $ ) embeds a scalar value into a one-hot vector. $W_{o}$ is a parameter that maps to the hidden state.

C. External Knowledge Integration

Based on the idea of integrating external knowledge to enhance the relation predictor, there are two stages. First, we build a knowledge-graph based on the external knowledge from ConceptNet [22]. Then, we build the encoding layer to encode the external knowledge for incorporating it into the relation predictor, whose knowledge-graphs are built based on the class labels of a set of object contexts, $C$ .

1) Knowledge-Graph Construction

The objective here is to build a word-embedding knowledge-graph from ConceptNet. Since ConceptNet provides various aspects of relation information, we build a knowledge-graph focusing on semantic relations consisting of “ relatedTo ”, “ similarTo ”, and “ synonym ” to improve the relation prediction of similar objects. In the building process, we first initialize the word collection to retrieve the semantic relations from VG200 [46] which consists of 150 labels by giving a class pair, $(x, y)$ , and then employ the connection between $(x, y)$ as $(V_{x}, V_{y})$ . Next, we gather all possible semantic paths from $V_{x}$ to $V_{y}$ as $P_{(x,y)}$ . Lastly, we employ GloVe word embedding [47] to encode all words.

2) Knowledge-Graph Integration

For the External Knowledge Graph Integration component, we first build a Graph Convolutional Network (GCN) using the GlobalSortPool operator [48], which enables learning from nodes on graph topology instead of summing them up, as an encoder for a knowledge-graph. Then, all knowledge-graphs of class pair $(x, y)$ in vector form are encoded into knowledge feature vectors as:\begin{equation*} \mathbf {e}_{\textrm {kb}}^{(x,y)} = \mathrm {GlobalSortPool}(\mathbf {N}^{(x,y)}), \tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\mathbf {N}^{(x,y)}$ represents all embedding nodes from $P_{(x,y)}$ .

In the training and evaluating processes, we first retrieve all predicted class pairs from the object context as $(x, y)$ . Next, we retrieve all possible connection paths from the knowledge-graph, $P_{(x,y)}$ . Lastly, all of them are encoded into $\mathbf {e}_{\textrm {kb}}^{(x,y)}$ and then concatenated into each contextualized representation to estimate relationships as discussed in the Relationship Prediction component.

D. Relation Prediction

The obtained object contexts by the previous process are used in the Relation Prediction component, in which a set of regions, $B$ , and objects are encoded by a bidirectional LSTM as:\begin{equation*} E = \textrm {biLSTM}([\mathbf {c}_{n};W_{2}\widehat {\mathrm {o}}_{n}]_{n=1,\ldots,N}), \tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $E$ is a set of edge contexts, in which each edge context contains the states of the bounding-box regions and $W_{2}$ is a mapping parameter of $\widehat {\mathrm {o}}_{n}$ . Each edge context is combined with the knowledge embedding and predicts the relation of each pair as:\begin{align*} \mathbf {g}_{i,j} &= (W_{h}\mathbf {e}_{i})(W_{t}\mathbf {e}_{j})\mathbf {f}_{i,j}, \tag{8}\\ \mathbf {r}_{i,j} &= \textrm {argmax}([\mathbf {g}_{i,j};\mathbf {e}_{\textrm {kb}}^{(i,j)}]W_{r}), \tag{9}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\mathbf {e}_{i}$ and $\mathbf {e}_{j}$ are edge context vectors of head and tail, $W_{h}$ and $W_{t}$ are parameters of heads and tails, $\mathbf {f}_{i,j}$ is a feature vector for the union of two bounding boxes, $W_{r}$ is a parameter that maps to the relation predictor, $\mathbf {e}_{\textrm {kb}}$ is a knowledge embedding vector, and $\mathbf {r}_{i,j}$ is a relation vector which is transformed into the relation and probability score by using softmax as an activation function.

E. Sub-Graph Confidence Score Calculation

From the implementation of multiple images to generate a summarized scene graph which aims to generate all possible relationships across images, we also need to re-estimate the relationship scores in a generated scene graph instead of using only confidence scores. The estimation aims to estimate triplet scores which are calculated from subject, predicate, and object confidence by analogy of PageRank [23].

To calculate a score, we first find summarized scores of each object using its confidence score as object scores as:\begin{equation*} \mathrm {obj\_{}score}_{i} = \sum _{j=0}^{N}\mathrm {obj\_{}confidence}_{i,j}, \tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $N$ is the count of the object. From each object score, we find the mean object score from all object scores, $\mathrm {mean_{obj}}$ as:\begin{equation*} \mathrm {mean}_{\mathrm {obj}} = \frac {1}{M} \sum _{i=0}^{M}\mathrm {obj\_{}score}_{i}, \tag{11}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $M$ is the number of the unique object. The mean object score is used for filtering out the object scores that are lower than the mean score.

Lastly, we collect the object pairs whose relation scores are greater than the mean score and employ PageRank to calculate the confidence score of each object whose process is detailed in Algorithm 1.

Algorithm 1 - Sub-Graph Score
Algorithm 1

Sub-Graph Score

SECTION IV.

Evaluation Process

Due to the lack of ground truth for this task, we use common metrics that are used in image collection scene-graph summarization tasks [2], [19]; similarity [16], [49], [50], coverage [28], [51], and diversity [52], [53] of a generated scene graph to the ground-truth scene graph of each image. However, most evaluation techniques focus on estimating the generating precision, in which the evaluation score tends to increase based on the quantity of the generated results. As such, we introduce an evaluation process which focuses on evaluating the quality of a summarized scene graph using F-score based on estimating the similarity between scene graphs. Since the estimation of the similarity between scene graphs has been attempted with various approaches, the technique of using word embedding shows a better qualitative estimation in scene-graph generation [50].

Given a ground-truth scene graph $\mathbf {G} = \left \{{t_{1}, \ldots, t_{n}}\right \}$ consisting of ground-truth triplets, a generated scene graph $\widehat {\mathbf {G}} = \left \{{\widehat {\textrm {t}}_{1}, \ldots,\widehat {\textrm {t}}_{m}}\right \}$ consisting of generated triplets, and each triplet in a scene graph denoted as $\textrm {t} = \langle s, p, o \rangle $ , where $s$ is subject, $p$ is predicate, and $o$ is object, we first employ GloVe [47] word embedding to transform all words in each triplet into token representation in a vector form. Then, we compute the similarity score of each triplet of a generated scene graph and each triplet of a ground-truth scene graph. Figure 4 illustrates the evaluation process.

FIGURE 4. - Overview of the evaluation process consisting of three components: From Candidate Triples and Reference Triplets, Word Embedding encodes both of them into a vector form, (A) Triplet Score Calculation calculates all triplet similarities between candidates and references, (B) Maximum Value Selection finds the maximum value of the similarity scores of each triplet pair, and (C) Graph Similarity Score Calculation calculates the final score.
FIGURE 4.

Overview of the evaluation process consisting of three components: From Candidate Triples and Reference Triplets, Word Embedding encodes both of them into a vector form, (A) Triplet Score Calculation calculates all triplet similarities between candidates and references, (B) Maximum Value Selection finds the maximum value of the similarity scores of each triplet pair, and (C) Graph Similarity Score Calculation calculates the final score.

The calculation process is adapted from BERTScore [54] to the evaluation process. In the BERTScore calculation, first, they implement Bidirectional Encoder Representations from Transformers (BERT) [55] embedding to tokenize all words of candidate and reference sentences into a vector form. Next, it calculates the similarity score between all words and then selects the maximum score of each word based on greedy matching. Lastly, it calculates the precision score, recall score, and F-score as evaluation metrics of a candidate sentence. From this process, we can also evaluate the false negative of a candidate scene graph, whereas other evaluation techniques mainly focus only on precision. Thus, in the proposed evaluation process, from all candidate triplets and reference triplets, we first encode all triplets into vector forms. Next, we calculate the similarity score between each triplet of all reference triplets and all candidate triplets. Then, we select the maximum similarity score of each candidate triplet calculation. Lastly, we calculate the precision score, recall score, and consecutively, F-score as a scene graph similarity score. Details of each process are described below.

A. Triplet Score Calculation

Given a generated triplet in a vector representation $\widehat {t}$ and a ground-truth triplet in a vector representation $t$ , each triplet comprises tokens of a subject, a predicate, and an object. To calculate the similarity between token representations, we estimate the similarity between each ground-truth subject and object and the generated subject and object by calculating the similarity $S$ as follows:\begin{equation*} S(\mathbf {a},\mathbf {b})=\frac {\mathbf {a}\cdot \mathbf {b}}{\left \|{ \mathbf {a} }\right \|\cdot \left \|{ \mathbf {b} }\right \|}, \tag{12}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\mathbf {a}$ and $\mathbf {b}$ are the corresponding embeddings of each pair of subject or object, $\mathbf {a}\cdot \mathbf {b}$ is the dot product between vectors $\mathbf {a}$ and $\mathbf {b}$ , and $\left \|{ \mathbf {a} }\right \|$ and $\left \|{ \mathbf {b} }\right \|$ are the L2 norms of vectors $\mathbf {a}$ and $\mathbf {b}$ , respectively.

As the similarity between subjects or objects is calculated based on word similarity, the similarity between predicates is specifically estimated in the definition of entity-based similarity. The calculation of the scene-graph similarity focuses on the relationship between objects, which can reduce redundant information [49]. We employ the calculation of the similarity between predicate $S_{\textrm {pred}}(\mathbf {p}_{i},\widehat {\mathbf {p}}_{j})$ as:\begin{align*} S_{\textrm {pred}}(\mathbf {p},\widehat {\mathbf {p}}) = \begin{cases} \displaystyle 1 \;\;\; \mathbf {p} = \mathbf {\widehat {p}};\\ \displaystyle 0 \;\;\; \mathbf {p} \neq \mathbf {\widehat {p}}. \end{cases} \tag{13}\end{align*} View SourceRight-click on figure for MathML and additional features.

In the following, given all similarity scores of triplet $t$ , consisting of subject similarity score $S_{\textrm {sub}}$ , predicate similarity score $S_{\textrm {pred}}$ , and object similarity score $S_{\textrm {obj}}$ , we compute them into a single value by calculating the mean score $M_{\mathrm {sim}}$ as:\begin{equation*} M_{\mathrm {sim}}(\mathrm {t}_{i},\widehat {\mathrm {t}_{j}}) = \textrm {mean}({{\left \{{S_{\textrm {sub}}(\textbf {s}_{i},\widehat {\textbf {s}}_{j}), S_{\textrm {pred}}(\mathbf {p}_{i},\widehat {\mathbf {p}}_{j}), S_{\textrm {obj}}(\textbf {o}_{i},\widehat {\textbf {o}}_{j})}\right \}}}). \tag{14}\end{equation*} View SourceRight-click on figure for MathML and additional features.

B. Maximum Value Selection

From similarity scores between triplets, we use the maximize function to find the maximum matching score and then summarize all maximum matching scores $\mathrm {Max}_{\mathrm {SGSim}}$ , where each candidate triplet $t$ is matched to a ground-truth triplet $\widehat {t}$ as:\begin{equation*} \mathrm {Max}_{\mathrm {SGSim}} = \sum _{t_{i} \in G}\underset {\widehat {t}_{j} \in \widehat {T}}{\textrm {max}}(M_{\mathrm {sim}}(t_{i},\widehat {t}_{j})). \tag{15}\end{equation*} View SourceRight-click on figure for MathML and additional features.

C. Graph Similarity Score Selection

To estimate the final similarity score between scene graphs, we first calculate the recall score with the ratio of the sum of the maximum similarity score to the norm of a ground-truth graph as:\begin{equation*} R_{\mathrm {SGSim}}=\frac {1}{\left |{ G }\right |}\mathrm {Max}_{\mathrm {SGSim}}. \tag{16}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Then, we calculate the ratio of the sum of the maximum similarity scores to the norm of a generated graph as:\begin{equation*} P_{\mathrm {SGSim}}=\frac {1}{| \widehat {G} |}\mathrm {Max}_{\mathrm {SGSim}}. \tag{17}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Lastly, the mean of $R_{\mathrm {SGSim}}$ and $P_{\mathrm {SGSim}}$ is calculated as:\begin{equation*} F_{\mathrm {SGSim}}=2\frac {P_{\mathrm {SGSim}} R_{\mathrm {SGSim}}}{P_{\mathrm {SGSim}} + R_{\mathrm {SGSim}}}. \tag{18}\end{equation*} View SourceRight-click on figure for MathML and additional features.

We demonstrate in Algorithm 2 the calculation process for all triplets of a summarized scene graph and a ground-truth scene graph.

Algorithm 2 - Graph Similarity
Algorithm 2

Graph Similarity

SECTION V.

Experimentals

A. Dataset

Due to the lack of image summarization datasets, we adapt two datasets, including an image captioning dataset, MS-COCO [24], and a visual scene-graph dataset, Visual Genome [33], for the experiments. For the training process and the preliminary evaluation, we use the VG200 dataset [46], which is based on the Visual Genome dataset consisting of 50 relationships and is balanced in category frequency. It contains 101,174 images from the MS-COCO dataset. To experiment on a scene-graph image-collection summarization task, we build a testing set of an image collection with annotation by grouping images in the MS-COCO testing dataset using VSE++ [56], which is an image retrieval task by estimating the similarity of image contexts and image captions. Following the Karpathy split [57] on the MS-COCO dataset, the initial testing set was selected from 5,000 images of the MS-COCO testing set. Then, we retrieved 5 images, with each image annotated with 5 captions, to build a collection, making our testing set contain 5,000 collections with 6 images each. Lastly, we build the ground truth of each image collection for the evaluation process in a scene-graph form.

As image summarization aims to generate summarized information that can describe the overall contexts of an image collection and the limitation of the ground truth in the proposed method, we use Neural Motif [35] pre-trained on the VG200 dataset and evaluated on Scene-Graph Detection Recall (SGDet R@100), to generate a scene graph of each image in a collection for evaluation. Then, we consider them as the ground truth of each collection for evaluating the proposed method which makes each collection consisting of 6 ground-truth scene graphs.

B. Training Strategy

With the lack of ground truth in scene-graph summarization datasets, we first train and evaluate the scene-graph generation on a single-image from the VG200 dataset. In the training phase, we train the model following the VG200 dataset where the number of labels and predicates are 150 and 50, respectively. The learning rate is initiated to 0.12. We use Adam [58] for optimization and cross-entropy loss as the loss function. To pre-evaluate the model for multiple-image scene-graph summarization, we observed a SGDet recall to select the best checkpoint for the proposed method.

C. Evaluation

As the proposed method is modified from a single-image scene-graph generation approach, we consider evaluating the proposed method in two aspects. First, Multiple-Images Scene-Graph Summarization evaluates the proposed method for an image-collection scene-graph summarization. Second, Single-Image Scene-Graph Generation evaluates the proposed method to confirm that it is still sustainable for a single-image scene-graph generation. Lastly, we benchmark the evaluation process in Benchmark for the Evaluation Process to show the accountability for scene-graph generation.

1) Multiple-Images Scene-Graph Summarization

For multiple-images scene-graph summarization, we evaluate the proposed method for image-collection scene-graph summarization on the MS-COCO dataset. Due to the lack of ground truth, we follow the common practice in the evaluation of scene graph generation in three perspectives; “ Coverage ” 28], [51, “ Diversity[52], [53], and “ Similarity[49], [50]. For the Coverage evaluation, we follow the graph theory to estimate the coverage of a generated scene graph to ground-truth scene graphs. For the Diversity evaluation, we implement two evaluation processes comprising graph diversity and Graph Edit Distance (GED) [59]. For the Similarity evaluation, we adopt a simple contrastive learning framework for connecting scene-graphs and images (GICON) [60] which is the evaluation technique by learning the similarity between an image and a scene graph with bounding boxes or without bounding boxes. Since the proposed method focuses on image collection summarization, we evaluate the proposed method only without bounding boxes. Lastly, we employ an evaluation process that evaluates the similarity of a summarized scene graph to the ground truth by SGSim proposed in Section IV.

2) Single-Image Scene-Graph Generation

For single-image scene-graph generation, we evaluate the performance on VG200 compared with the baseline to ensure that the proposed method still sustains good results. We evaluate three scene graph evaluation metrics; Scene Graph Classification Recall (SGCls Recall), which measures subjects, objects, and predicates using ground-truth bounding boxes, Predicate Classification Recall (PredCls Recall), which is the relationships prediction using ground-truth bounding boxes, subjects, and objects, and Scene Graph Detection Recall (SGDet Recall), which is the prediction of subjects, objects, and predicates without using the ground truth.

3) Benchmark for the Evaluation Process

Here, we discuss the evaluation metric to ablate the evaluation process. As it is proposed for evaluating scene-graph generation, we benchmark it on single-image scene-graph generation with the VG200 dataset by comparing it with other scene-graph generation baselines. As we aim to evaluate based on the false negative generation, we assess it with two evaluation metrics. First, Scene Graph Detection Recall (SGDet Recall) is a popular scene graph evaluation metric. Next, GICON is an evaluation metric from learning the similarity between a generated scene graph and an image with bounding boxes or without bounding boxes.

D. Baselines

As discussed in the previous section, the evaluation is divided into three tasks; multiple-images scene-graph summarization, single-image scene-graph generation, and the ablation study on the evaluation process. In this section, we introduce baseline methods corresponding to each of them.

1) Baseline for Multiple-Images Scene-Graph Summarization

To evaluate the proposed method in the multiple-images scene-graph summarization setting, we choose three baseline methods; Semantic Image Summarization (SImS) [2], Image Collection Captioning (ICC) [8], [9], and $k$ -Medoids [17]. SImS is a scene graph summarization method on the MS-COCO dataset by finding frequent sub-graphs. ICC is a scene graph summarization method preciously proposed by us for generating a caption based on graph theory. $k$ -Mediods is a clustering method in which the implementation of the summarization is the same as the SimS [2]. All of these baselines are evaluated on the testing set of the MS-COCO dataset which consists of 6 images per image collection.

2) Baseline for Single-Image Scene-Graph Generation

To evaluate the proposed method for single-image scene graph generation, we choose four baseline methods; Iterative Message Passing (IMP) [41] which uses the standard Recurrent Neural Network (RNN) via message-passing process, Neural Motif [35] which is implemented based on Stacked Motifs architecture, Transformer [42] which is based on causal inference, and Visual Context Tree (VCTree) [40] which takes advantage of the structured object representations. All of the baseline models are trained on the VG200 dataset. Then, we observe the best checkpoint on SGCls Recall, PredCls Recall, and SGDet Recall for evaluating the proposed method on the VG200 dataset.

3) Baseline for Ablation Study on the Evaluation Process

For the ablation study on the evaluation method, we aim to benchmark the evaluation process compared to other evaluation metrics. We choose four state-of-the-art scene-graph generation methods on the Visual Genome dataset; Neural Motif, Transformer, Relationship Detection Network (RelDN) [61], and Relation TRansformer (RelTR) [43].

E. Results

We report the results of the proposed method in the three evaluation tasks; multiple-images scene-graph summarization, single-image scene-graph generation, and the ablation study on the evaluation process.

1) Multiple-Images Scene-Graph Summarization

In this section, we discuss the results of the proposed method for an image collection summarization task. For comparison, we select top-10 scores in three aspects; Coverage, Diversity, and Similarity. The results are shown in Table 1.

TABLE 1 Evaluation of an image collection summarization compared with SImS [2], $k$ -Medoids [17], ICC [9], and the proposed method by estimating Coverage [30], Diversity [30], GED [59], GICON [60], and the proposed evaluation process (SGSim). Results in bold indicate the highest scores whereas those underlined indicate the second highest scores
Table 1- 
Evaluation of an image collection summarization compared with SImS [2], 
$k$
 -Medoids [17], ICC [9], and the proposed method by estimating Coverage [30], Diversity [30], GED [59], GICON [60], and the proposed evaluation process (SGSim). Results in bold indicate the highest scores whereas those underlined indicate the second highest scores

a: Coverage

For Coverage evaluation, the coverage of objects and subjects (nodes), and predicates (edges) are evaluated based on graph theory [30]. The results in Table 1 show that $k$ -Medoids achieves the best score in generating a summarized scene graph, whereas the proposed method achieves the second place.

b: Diversity

For Diversity evaluation, we use two evaluation metrics which consist of Diversity and GED. Diversity evaluation refers to the similarity distance between a summarized scene graph and a ground-truth scene graph. The results in Table 1 show that the proposed method achieves the best scores in both Diversity and GED, whereas SImS achieves the second place for Diversity and $k$ -Medoids achieves the second in GED.

c: Similarity

For Similarity evaluation, Table 1 shows that the proposed method achieves the best score compared to the other methods on GICON which estimates the Similarity between scene graph and images, and the proposed evaluation process, SGSim which is evaluated based on scene-graph contents. Meanwhile, $k$ -Mediods achieves the second place in both GICON and SGSim.

d: Qualitative Results

Qualitative results are shown in Fig. 5. It demonstrates that the proposed method performs good in finding the relationship and estimating the commonly occurring information. For example, the first example shows how the proposed method can find all common object information, such as sheep and cow, and further estimate the commonsense relationship between them based on the location, hill. In contrast, SImS and $k$ -Medoids generate a summarized scene graph based on the most frequent object, cow, neglecting the other common object, sheep. ICC can generate information of sheep but cannot infer common relationships between sheep and cow. The second example shows how the proposed method can handle the overall context location of an image collection, while SImS, $k$ -Medoids, and ICC lose some of the overall location information in their results. The third example shows the performance in finding summarized information of bus, and their common environmental characteristics street, building, and people are connected. In contrast, SImS, $k$ -Medoids, and ICC fail to include the object people.

FIGURE 5. - Comparison of the proposed method with baseline methods in three examples. (A) Proposed shows a summarized scene graph generated by the proposed method. (B) SImS demonstrates that by Semantic Image Collection Summarization [2]. (C) 
$k$
 -Medoids demonstrates the clustering technique [17]. (D) ICC demonstrates that generated by our previous work [8–9] using graph theory.
FIGURE 5.

Comparison of the proposed method with baseline methods in three examples. (A) Proposed shows a summarized scene graph generated by the proposed method. (B) SImS demonstrates that by Semantic Image Collection Summarization [2]. (C) $k$ -Medoids demonstrates the clustering technique [17]. (D) ICC demonstrates that generated by our previous work [8–9] using graph theory.

From the overall evaluation scores, the proposed method achieves better scores in Diversity and Similarity perspectives, whereas $k$ -Mediods achieves the best score in Coverage. Most $k$ -Medoids scores achieved the second place except for Diversity where SImS [2] achieved the second place. Meanwhile, the qualitative results showed the performance of finding common context being beneficial for, e.g., summarization tasks such as photo album summarization.

e: Limitations

There are two main limitations of the proposed method. First, the relationships are not grounded in visual information but rather built from the commonsense knowledge graph. As such, it might result in generated summaries not fully related to the actual image collection. Second, the method might not adjust well to large image collections, as it aims to estimate all possible relationships of all images. This can result in high memory requirements for summarizing large image collections.

2) Single-Image Scene-Graph Generation

The single-image scene-graph generation results are shown in Table 2. Since the objective of the proposed method mainly observes Scene Graph Detection (SGDet), we focus on its result when assessing a single-image scene-graph generation task. The result of SGDet R@100 shows that Nerual Motifs [35] and Transformer [42] achieve better results compared with the proposed method while the proposed method achieves better results compared with IMP [41] and VCTree [40]. In contrast, the results of R@20 and R@50 show that the proposed method achieves better results only compared with IMP [41]. As the proposed method aims to enhance the relation prediction toward unseen relationships, it is not restricted to the ground truth in a single-image scene-graph generation as shown in the result. As the proposed method shows better scores compared to IMP in SGDet evaluation, RelDN in SGCls, and IMP and RelDN in PredCls, it is still sustainable for a single-image scene-graph generation even if it cannot overcome other scene-graph generation baselines.

TABLE 2 Evaluation of single-image scene-graph generation on an image from the VG200 dataset compared with baseline methods; IMP [41], Neural Motif [35], Transformer [42], VCTree [40], RelDN [61] and the proposed method by observing recall scores of SGDet, SGCls, and PredCls. Results in bold indicate the highest scores whereas those underlined indicate the second highest scores
Table 2- 
Evaluation of single-image scene-graph generation on an image from the VG200 dataset compared with baseline methods; IMP [41], Neural Motif [35], Transformer [42], VCTree [40], RelDN [61] and the proposed method by observing recall scores of SGDet, SGCls, and PredCls. Results in bold indicate the highest scores whereas those underlined indicate the second highest scores

However, since the proposed method targets multiple-images summarization, this out-of-task evaluation was purely performed to understand the limitations of this approach.

3) Ablation Study on the Evaluation Process

We benchmark our evaluation process with existing methods that evaluate graph-oriented (graph structure) and graph similarity-oriented (similarity vertices and edges) compared to other evaluation methods. In the benchmark process, we construct a benchmark for single-image scene-graph generation for the VG200 dataset and perform analysis on four models; Neural Motif, Transformer, VCTree, and RelTR. For graph-oriented evaluation, we use Scene Graph Detection Recall (R@20, R@50, R@100) as a metric. For the graph similarity-oriented evaluation, we use GICON which is a learnable graph similarity metric for evaluating with bounding boxes (W/ Bounding Box) and without bounding boxes (Location Free). For each evaluation, we find the top-$k$ triplets that are used in the process in which $k$ are 10, 30, and 50 triplets. In the triplet selection, we observe the relationship scores to find the top-$k$ triplets for the benchmark.

The benchmark results in Table 3 report the results based on the number of retrieved triplets with confidence scores which shows the relevance to the rise of the scores. However, the high number of triplets does not always increase the similarity in the evaluation process, as shown in Transformer and RelTR. As RelTR is provided for inferring a fixed-size set of triplets, even if we increased the number from 30 to 50 triplets, the accuracy is still not significantly improved. Meanwhile, Transformer shows little improvement when increasing the retrieving number of triplets, and the other methods show significant improvement when increasing the retrieving number of triplets. Consequently, the other evaluation metrics, GICON and SGDets, focus on evaluating precision and recall, so the high retrieving number of triplets tends to result in high scores.

TABLE 3 Benchmark of the evaluation methodology compared with SGDet with R@20, R@50, and R@100 and GICON [60] for both location free and with bounding box. SGSim with $k$ is the number of triplets that is used in calculating the similarity score. Results in bold indicate the highest scores whereas those underlined indicate the second highest scores
Table 3- 
Benchmark of the evaluation methodology compared with SGDet with R@20, R@50, and R@100 and GICON [60] for both location free and with bounding box. SGSim with 
$k$
 is the number of triplets that is used in calculating the similarity score. Results in bold indicate the highest scores whereas those underlined indicate the second highest scores

SECTION VI.

Conclusion

We introduced a scene-graph summarization method following the idea that aims to enhance the relation predictor in the training process for an image collection incorporating external knowledge. The results show that the proposed method can generate a summarized scene graph that is good in diversity and similarity perspectives compared with other baseline methods while it still lacks accuracy in terms of the coverage information. Additionally, the experimental results showed the advantage of using external knowledge in grasping the overall context of an image collection for finding the common relationships across images which is beneficial for a summarization task, especially, photo album summarization. However, the limitation is the lack of actual ground truth in the evaluation process. In the future, we plan to build a more suitable dataset for an image-collection scene-graph summarization task.

ACKNOWLEDGMENT

The computation was carried out using the General Projects on the supercomputer “Flow” at Information Technology Center, Nagoya University.

References

References is not available for this document.