Towards Open-Set Scene Graph Generation with Unknown Objects

Scene graph generation (SGG) aims to detect objects and their relationships in an image, thereby enabling a detailed understanding of a complex scene for various real-world applications. In SGG applications such as robot vision, it is important to correctly detect all objects without recognizing any object as something different or ignoring it. However, previous studies on SGG do not consider unknown objects whose classes are unseen in training. Consequently, current SGG methods wrongly classify them as known object classes or overlook them. In this paper, we propose a new problem named “open-set SGG” with unknown objects, focusing on detecting even unknown objects and their relationships. Specifically, we formally define this new problem and propose an evaluation protocol, including an extended dataset with unknown objects and novel evaluation metrics designed for the open-set setting. We also build baseline methods by employing and extending existing SGG methods and compare them through experiments to establish the current baseline performance of open-set SGG. Finally, we discuss the limitations of the current SGG methodology in the open-set setting and point out future research directions.


I. INTRODUCTION
S CENE graphs are detailed descriptions of scenes by graphs consisting of objects as vertices and their relationships as edges [1]. Recently, the prediction of such a graph from an image [2] has been studied, which enables automated scene graph generation (SGG) from an image captured in the real world and thereby facilitates a detailed understanding of complex scenes. SGG has a wide range of applications such as image retrieval [1], visual question answering [3], humanrobot interaction [4], and robot navigation [5].
For obtaining the complex mapping from an image to its scene graph, SGG methods rely on deep learning driven by big data, which uses a deep neural network as a prediction model and estimates its parameters from a large number of training images with ground-truth scene graphs. However, due to the difficulty of high-quality manual annotation [6], existing SGG datasets are limited in terms of the variety of the object classes with usable labels for training [2]. The difficulty results in the presence of unknown objects, which are absent in the training data. If such an object is present in the testing phase, the current SGG methods either classify it into one of the known object classes or completely fail to detect it by treating it as a part of the background, as shown in Fig. 1a. In practice, misclassified or overlooked objects lead to incorrect scene understanding and cause serious problems in applications; for example, if a robot recognizes an object as something different, it may take an inappropriate action on the object, or if it is unaware of the existence of the object, it may even exhibit a dangerous behavior. Although such a problem setting of handling unseen and untargeted classes has been referred to as open-set [7] and addressed in tasks such as image recognition and object detection, it has never been tackled in the literature of SGG. It involves the detection of relationships as well as objects.
In this paper, we address the problem of open-set SGG with unknown objects. To the best of our knowledge, this is the first study on such a problem. This task predicts a scene graph where unknown objects are correctly localized and classified as "unknown," rather than classifying them as one of the known classes or ignoring them as background, as shown in Fig. 1b. In addition, this enables the detection of relationships involving unknown objects, which previous studies have completely ignored. Specifically, we first provide a formal definition of the open-set SGG problem. Then, we propose an evaluation protocol, including a scene graph dataset with unknown objects. We construct the dataset from an existing large-scale dataset by defining unknown object classes and splitting a sufficient number of training images without unknown objects. Also, we propose novel evaluation metrics to quantitatively measure the open-set performance of SGG, focusing on the effect of unknown objects in both object and relationship detection.  We demonstrate the current performance of open-set  SGG by comparing the baseline methods through extensive experiments based on the proposed evaluation  protocol (Section IV-C), discussing the limitations of the current SGG methodology, and pointing out future research directions (Section V). • We will make our implementation (Section IV-B) for the dataset preparation, baseline methods, and experiments publicly available upon publication as a benchmark of open-set SGG to facilitate future research.

II. RELATED WORK A. SCENE GRAPH GENERATION
Scene graphs provide a more detailed description of scenes than image recognition (image-wise object classification) and object detection (localization and region-wise classification), detecting not only individual objects but also their relationships [1]. First, we recall the definition of closed-set SGG, i.e., the previous problem setting that considers known objects only. Let K be a set of known object classes. Given an RGB image I ∈ R W ×H×3 , where W, H ∈ N + are its width and height, respectively, object detection, which is a subproblem of SGG, aims to localize and classify each i-th object by predicting bounding box which are the horizontal and vertical center locations, width, and height of the bounding box, respectively, and object class o i ∈ K. SGG further detects each k-th relationship for object pair (i k , j k ) by predicting relationship class r k ∈ C, where C is the set of relationship classes. The goal of closed-set SGG is to build a model that can predict these bounding boxes and classes for all objects and predict relationships in the given image, i.e., a mapping from I to label T = where n, m ∈ N + are the numbers of objects and relationships, respectively. This is typically achieved by data-driven learning using pairs of images and ground-truth labels as training data.
The most widely-used SGG dataset is Visual Genome (VG) [6], which is a large-scale dataset consisting of images from object detection datasets such as MS COCO [8] and labels made by crowdsourcing-based annotation. The majority of SGG studies [9]- [14] also employ the preprocessing proposed for the early SGG method named iterative message passing (IMP) [2], which removes noisy labels in VG and then randomly splits images into training and testing data.
As large-scale datasets such as VG have become available, many SGG methods have employed the modern deeplearning approach, and various models have been proposed [2], [9]- [12], [15]- [18]. These models make full use of the continuously evolving methodology of deep neural networks consisting of various components, e.g., convolution [17], graph convolution [10], long short-term memory [9], and transformers [18], resulting in quite different network architectures among models. Since the selection of the best model depends on the types of targeted scenes and individual applications, we do not aim to build a specific model for open-set SGG in this paper.
Apart from models, various SGG techniques have been proposed, e.g., losses [11], [14] and learning strategies [13], [19], aiming at improved performance regardless of model. These topics are orthogonal to our open-set SGG, whose focus is on dealing with unknown objects rather than improving closed-set performance. Applying these techniques to the open-set setting is out of the scope of this paper.
While we consider the open-set setting of SGG for the first time, previous SGG studies addressed related topics called few-shot and zero-shot learning [20], [21]. These settings in the context of SGG are different from our open-set SGG since they aim to detect relationships that involve rarely-seen or unseen combinations of seen object classes that are present in training data. e.g., predicting the "stand on" relationship between the "elephant" and "street" objects when all these classes appear in the training data but their combination does not [22]. Instead, we deal with unseen object classes themselves in our open-set SGG setting. We naturally handle unseen class combinations in this setting since any class combinations involving unknown classes are necessarily unseen in training. Meanwhile, we do not separately consider the previously-addressed case, i.e., unseen combinations of seen classes only, since it requires a specialized train-test data split. Although there is a recent study [23] that claims to address "open-set SGG", its problem setting is closer to zero-shot learning in the non-SGG literature, e.g., image recognition and object detection [24], [25]. It attempts to classify individual unseen classes by associating them with seen classes using external knowledge such as language information. In contrast, our problem setting of open-set SGG is consistent with those of open-set recognition and detection described in Section II-B, i.e., we do not distinguish within unknown objects but aim to separate them from known classes (technically by assigning their instances to a special single "unknown" class) without the need of additional information.
SGG has been further extended using additional data, e.g., language information such as captions [22], [26], temporal information from videos [27]- [29], and 3D spatial information from depth images or point clouds [4], [5], [30]- [33]. Although the ability to handle unknown objects is also important in these augmented problem settings, we focus on the open-set generalization of the standard single-image SGG problem, leaving these advanced topics for future research.

B. OPEN-SET OBJECT DETECTION
Open-set image recognition [34] is a relatively new research topic that aims to deal with unknown classes in imagewise classification [7]. It typically consists of a conventional closed-set recognition part and an unknown detection part, and the unknown detection part is technically similar to anomaly detection [35] and novelty detection [36] in rejecting unknown classes, although open-set recognition also classifies known objects in the closed-set recognition part. While early studies employed traditional learning techniques such as the support vector machine [37], [38], motivated by recent advancements in closed-set recognition using deep learning, deep neural networks have become popular in openset recognition [39], [40]. Recently, the open-set methodology of image-wise classification has been extended to regionwise classification after localization, thereby initiating the problem of open-set object detection [41]- [44].
The main difference of the proposed open-set SGG problem from the open-set object detection is that SGG classifies all objects and their relationships simultaneously, considering their contextual dependencies. Thus, we present novel experimental results for relationship-aware open-set object detection and relationship detection, both of which have not been evaluated by the previous studies. In addition, unlike a recent evaluation study [43] on open-set detection, we compare several baseline methods, including unknown-aware extensions of existing methods. Although more sophisticated unknown detection techniques have been proposed for openset recognition [39], [40] and object detection [44], we leave integration of such advanced techniques with state-of-the-art SGG methods as a future research topic.
Another important difference from object detection is that SGG needs a specialized dataset with ground-truth relationship labels for training and testing. Thus, we cannot reuse the open-set detection datasets with unknown objects used in the previous studies, nor follow their dataset construction scheme [43], [44], which relied on the availability of multiple large-scale datasets with mutually exclusive class definitions. Instead, we propose a frequency-based class selection scheme for defining unknown classes, which enables us to split training images without unknown classes while maintaining sufficient training data as part of our novel evaluation protocol for open-set SGG.
Open-set problems have been further extended to openworld problems [44]- [46], where unknown classes incrementally turn into new known classes. Extending open-set SGG to open-world is an interesting but advanced topic, thus being out of the scope of this work.
The differences between the proposed open-set SGG compared with closed-set SGG and open-set object detection are summarized in Table 1. This table highlights the novelty of this work.

III. OPEN-SET SCENE GRAPH GENERATION A. PROBLEM FORMULATION
In closed-set SGG, if the assumption o i ∈ K is violated, i.e., if an object does not belong to any known class in K is present in an image I, either (1) the model will classify it to one of the known classes, or (2) the model will treat it as background and not detect it as an object. This has not been regarded as a failure in previous studies. On the other hand, in this study, we consider that the prediction for an unknown object has failed if (1) all predicted objects overlapping with it are classified into known classes. We also VOLUME 4, 2016  consider so if (2) no predicted objects overlap with it. Here, we assume that the ground-truth label of the unknown object is available. Such a failure in object detection also has a negative impact on relationship detection since the prediction of relationship classes is typically conditioned on predicted object classes [9]. Now, we provide the formal definition of open-set SGG by extending that of the closed-set SGG. In open-set SGG, we also have a set of unknown object classes U, which is exclusive from the known classes, i.e., K ∩ U = ∅, and each object class may be either known or unknown, i.e., o i ∈ K ∪ U. This is the essential difference from the closedset setting. By the definition of the open-set recognition setting [7], any object of the unknown classes cannot appear in training images, i.e., the model cannot see objects of the unknown classes in training and thus cannot learn how to classify objects into these classes. Hence, we do not include the individual unknown classes in the target classes of object classification and aim to assign objects of any unknown classes to the special single class "unknown" in testing. This class assignment is typically achieved by introducing some training-free mechanisms of unknown detection to the model, which is only enabled in testing. Moreover, in open-set SGG, the object pair of each relationships consists of two known objects (as in closed-set SGG), one known object and one unknown object, or two unknown objects. This new problem formulation allows us to tackle the issues of the closedset SGG in the presence of unknown objects, i.e., wrong classification and failure in detection.

B. DATASET
In order to bypass the difficulty of large-scale annotation from scratch and to naturally extend the previous methodology of closed-set SGG such as noise-label removal and known-class selection to the open-set setting, we make full use of the existing data by employing the combination of the VG dataset and the IMP preprocessing, introducing unknown objects to it. Specifically, we alter the IMP preprocessing to select unknown object classes and then extract images without unknown objects for training, which is a requirement of the open-set setting. To avoid using noisy class labels, we first discard low-frequency classes by selecting the most frequent 1,500 object classes in terms of the number of objects in VG (after removing small or overlapping objects in the original IMP preprocessing). This results in at least 100 objects per class. Among them, we use the same 150 object classes as in IMP as known object classes to facilitate comparison with the previous closed-set setting, and we select unknown classes from the other 1,350 classes.
The issue here is that an image must belong to testing data if it contains any objects belonging to unknown classes; otherwise, a model would learn to treat unknown objects as background since they have no bounding box labels in training data. Such violation of the open-set assumption leads to low performance for unknown objects in testing. However, a random selection of unknown classes may assign most images in the dataset to testing, leading to unsuccessful training due to insufficient data. Indeed, as shown in Fig. 2, the number of testing images rapidly grows as we randomly add classes to the unknown set, leaving almost no training images. Note that previous studies on open-set object detection did not face this issue, since they could combine multiple datasets with mutually exclusive class definitions to ensure that images from one dataset do not contain the classes from the others [43], [44], while we do not have other large-scale datasets like VG to be combined.
To overcome this issue, we propose to select unknown classes from low-frequency classes, which results in almost linear correspondence between the number of unknown object classes and that of testing images, as shown in Fig. 3. We name this the low-frequency-first unknown-class selection scheme. This enables us to easily control the ratio between the numbers of training and testing images by that of known and unknown classes. Here, we approximate the ratio of the original IMP split (the image splitting scheme of the VG dataset employed by the IMP preprocessing), i.e., 7:3 between training and testing, by selecting 30% of the lowest frequency classes for 406 unknown classes from the 1,350 classes.
After defining known and unknown classes, we remove the object whose classes are neither known nor unknown by dropping their bounding boxes and also remove their relationships from each image. Then, we also remove the images that consequently have no objects or relationships, which would not contribute much to SGG training, following previous studies [10], [13]. Other parts of the IMP preprocessing, e.g., removing invalid images in VG and selecting 50 relationship classes, are unchanged.

C. METRICS 1) Closed-/open-set recalls
For quantitative SGG evaluation, recall-based metrics are often used. They count correctly-detected ground-truth relationships in each image. Specifically, we use the following types of commonly-used recall-based metrics that perform prediction differently [22]: SGCls (scene graph classification) is a recall of the prediction of object classes and relationship classes given The total number of testing images The total number of testing images ground-truth bounding boxes. SGDet (scene graph detection) is a recall of the prediction of bounding boxes along with the classes without using any ground-truth labels of bounding boxes nor classes.
We simply refer to the collection of metrics of these two types as recalls in this paper. Note that we do not use Pred-Cls (predicate classification), another common SGG metric, since it needs the support of ground-truth object class labels in relationship detection, which is generally nontrivial for unknown classes and requires model-dependent modifications (e.g., when class-wise embeddings are needed [9], [12]). Also, note that, by following the previous SGG studies [2], we do not use precision metrics for SGG evaluation. This is because they may penalize the detection of unlabeled objects and relationships in VG, whose annotation is incomplete due to the limitation of crowdsourcing and yield uninterpretable metric values.
To compute a specific recall metric, we count each ground-truth relationship where it is correctly localized and classified. That is, it has at least one predicted relationship whose two corresponding bounding boxes overlap with the groundtruth boxes respectively with intersection over union (IoU) over 0.5 and whose two object and one relationship classes match the ground-truth classes. Note that in the case of SGCls, all predicted relationships are estimated using the ground-truth bounding boxes, and thus all ground-truth relationships are always correctly localized. In addition, we only consider top-K predicted relationships sorted by the product of the classification scores corresponding to the three classes in each relationship and denote each recall-based metric by its type suffixed with the K value, e.g., SGDet@100 when K = 100.
In our evaluation protocol, we adapt each recall metric to the open-set setting and consider the following two versions: The closed-set version ignores ground-truth unknown objects and does not count the relationships involving them. This is equivalent to the recall-based metric used in previous closed-set SGG studies. The open-set version regards all unknown classes as the single "unknown" class and treats it in the same manner as individual known classes when matching groundtruth and predicted object classes.
By comparing these two versions, we can see the effect of unknown objects in SGG and highlight problems in previous evaluation protocols. Note that, while the closed-set recalls have been extensively used in previous studies on SGG, the open-set recalls are a novel collection of evaluation metrics proposed in this paper, which is designed specifically for the new problem of open-set SGG.

2) Open-set object/relationship counts
In addition to these metrics as a natural extension of previous closed-set SGG, we also propose recall-like metrics designed for detailed analysis of object and relationships detection in open-set SGG with unknown objects, inspired by previous studies in open-set object detection [41], [43], [44]. As in the case of the recall-based metrics, we count ground-truth objects or relationships in each image. For the first of the two metric collections that we propose, we count ground-truth objects while distinguishing whether each of them belongs to (0) known or (1) unknown classes (where we enumerate these cases using the number of unknown objects to be consistent with the relationship-counting metrics described below) and whether its prediction is (a) correct: the ground-truth object is correctly localized and classified (possibly into the single "unknown" class) by a predicted object. Here, we define the correct localization and classification for objects in the same manner as object detection. That is, for correct localization, the predicted object must overlap with the ground-truth object with IoU over 0.5. For correct classification, the overlapping object, or if multiple overlapping predicted VOLUME 4, 2016 objects exist, at least one of them must have the same class as the ground-truth object. (b) wrong: it is correctly localized by one or more predicted objects but not correctly classified by any of them. (c) background: it is not correctly localized by any predicted objects. We also consider only top-K predicted objects sorted by object classification scores. By considering all possible combinations of the two ground-truth categories (0/1) and the three prediction categories (correct/wrong/background), we obtain six scores in total. We denote each count-based metric by the combination of two categories, e.g., "0-correct" for known and correctly-classified objects and "1-background" for unknown and undetected objects. Also, we call the proposed collection of these six count-based metrics open-set object counts (OSOC), which are suffixed by an actual K value, e.g., OSOC@100. We note that the number of unknown objects classified as known ("1-wrong") coincides with absolute open-set error proposed for open-set object detection [41], which was also used in the recent study proposing a stateof-the-art open-set object detection method [44]. Meanwhile, we do not use precision-like metrics such as another open-set detection metric called wilderness ratio [43] since they are not suitable to the sparsely annotated VG dataset as described above.
Similarly, we count the number of ground-truth relationships by distinguishing the number of unknown objects that are involved in it, i.e., (0) zero, (1) one, or (2) two, and whether its prediction is (a) correct (correctly localized and classified), (b) wrong (correctly localized but not correctly classified), or (c) background (not correctly localized). Here, the definition of the correct localization and classification of relationships, as well as the top-K selection, are the same as SGDet. We call the collection of the resulting nine metrics open-set relationship counts (OSRC).

A. BASELINE METHODS
To establish the baseline for the new problem of open-set SGG, we evaluated several different SGG methods in our experiments. More specifically, we compared the following representative models originally proposed for closed-set SGG: Freq [9] is a simple model that predicts relationship classes by using their frequencies given object classes. It takes the object classes of each pair predicted by object detection and returns the most probable relationship class given them by referring to the objectconditioned relationship-class distribution learned from training data. It is called a strong baseline [10] because it often achieves surprisingly high performance without using other information to classify relationships. IMP [2] is the model of one of the earliest SGG methods.
It uses iterative message passing on image-wise graphs to predict both object and relationship classes in consideration of their context. Though it is relatively simple, its performance is reportedly comparable to more recent models [29]. VCTree [12] is a recent model that uses dynamic tree structures to perform context-dependent message passing while considering hierarchical relationships of objects. This model has been employed in more recent studies that focus on SGG techniques other than models, including losses and learning strategies, to achieve stateof-the-art performance [13], [14].
For a fair comparison, we fixed the network architecture other than the relationship detection part of these models. In particular, for the object detection part, which precedes the relationship detection part and predicts bounding boxes and object classes, we employed Faster R-CNN [47] as in the majority of SGG studies [9], [13]. It has been known that the two-stage design of Faster R-CNN is advantageous for openset object detection [43] since its first region-proposal stage relies only on class-agnostic objectness, thereby being able to localize unknown objects similar to the objects in the training data. The relationship detection part may further update the outputs of the object detection part, depending on models, and yields final outputs, i.e., bounding boxes, object classes, and relationship classes. The objects and relationships are ordered by classification scores for these classes.
Furthermore, we built an unknown-aware version of each previous method by introducing a simple technique for unknown object detection, thereby enabling the measurement of the baseline performance of unknown-aware SGG. This also enables detection of the relationships of unknown objects without further modification to models, exploiting the similarity of known and unknown objects. By noticing that classification scores represent the confidence of being known classes, we applied thresholding to the class-wise scores of each predicted object, and if the scores for all classes were below a threshold, we updated the object class to the single "unknown" class. Note that this thresholding was not applied in training since the open-set setting assumes that unknown objects are only present in testing. Also, note that similar thresholding techniques have been widely used in open-set image recognition and object detection [34], [37], [38], [40], [44], although they relied on more complicated strategies for score calculation, etc., which are out of scope of this paper. We denote the new unknowndetecting version of each previous method with suffix "+", e.g., Freq+.
Given the VG dataset with unknown objects described in Section III-B, we trained each model using the training data and evaluated it quantitatively by computing the metrics in Section III-C over the testing data. Here, we performed prediction by the two versions for each of the three models, thereby comparing six baseline methods. In addition, we performed qualitative evaluation by visualizing predicted scene graphs on several images.

B. IMPLEMENTATION
We employed the publicly-available implementation 1 of a previous study on closed-set SGG [13], which supports multiple SGG models, including the above-mentioned ones and several metrics such as SGCls and SGDet. Note that the unbiasing technique, which was the main focus of the previous study [13], was not used since it is orthogonal to our study. We modified the code of this implementation to support the thresholding-based unknown detection described in Section IV-A and the proposed open-set metrics in Section III-C. This implementation also depends on the VG dataset, which is publicly-available 2 . Additionally, to introduce unknown objects to the dataset as described in Section III-B, we modified the code of the implementation 3 of IMP [2], whose preprocessing is also assumed by the main implementation [13]. We also modified the IMP code for the visualization mentioned in Section IV-A.
For the hyperparameters of the previous methods, we reused the default values in the implementation [13]. Meanwhile, we tuned the only hyperparameter introduced in this study, i.e., the threshold value for each unknown-detecting method, by optimizing open-set SGDet@100 (the closed-set version of which is used for validation for early stopping, etc. in the implementation [13]) over validation data, which were split from the training images by selecting additional unknown classes in the same manner as the testing data in Section III-B. Here, we set the ratio between the numbers of training and validation images to 6:1 and performed a grid search over the threshold values from 0.1 to 0.9 with stride 0.1, based on the fact that scores are bounded in [0, 1]. After validation, we retrained the method using the original training data, including the validation data, to maximize the amount of training data. Then, we tested it with the best threshold value.

1) Closed-/open-set recalls
First, we show the quantitative results of the previous metrics, i.e., closed-set recalls, computed over the testing data of our new training-testing data split of the VG dataset for each previous closed-set method, in Table 2. Here, we computed each metric for each image and averaged over all images in the testing data while using the same K values 20, 50, and 100 as previous studies [13], [14]. We can see that the metric values are close to previously-reported results [13] for the original split defined by the IMP preprocessing, where unknown objects were not considered, indicating that our new split of the VG dataset itself does not affect the closedset performance so much.
Next, we show the results measured by the proposed new metrics, i.e., open-set recalls, in Table 3. Here, we compare  both the original versions of the previous methods and their unknown-detecting versions (denoted by the suffix "+"). We first observe that the scores of the original methods were significantly lower than those in the closed-set setting in Table 2. This result reveals the limitation of the closed-set SGG evaluation protocols used in previous studies for open-set SGG, i.e., they can yield unrealistically high-performance scores in the presence of unknown objects, which is often the case in practice and thus problematic in applications. We believe that our new open-set evaluation protocol better reflects the realworld performance of SGG. Another observation is that each method's performance could be improved consistently for all metrics when the thresholding-based unknown detection was enabled. Thus, despite being simple, the thresholding technique can be effectively used to build baseline methods of open-set SGG by turning any closed-set method into an unknown-aware version.
Overall, the methods based on the VCTree model achieved the highest performance in both the closed-and open-set settings, although our open-set methodology proposed in this paper is orthogonal to models and can be applied to any newer methods.

2) Open-set object/relationship counts
To perform a quantitative analysis of open-set SGG in more detail, we invoke another collection of new metrics, i.e., OSOC. Similarly to recalls, each count-based metric was computed for each image and averaged over all testing images. We show a plot of the results only where K is equal to 100 in Fig. 4, as we observed that other K values yielded similar results. From this plot, we can clearly see that each unknown-aware version successfully recovered a significant proportion of the unknown objects (areas of "1-correct" of the "+"-suffixed methods) that were wrongly classified to any known classes by their original versions ("1-wrong" areas of the non-suffixed methods). This demonstrates the effectiveness of the simple thresholding-based unknown detection in dealing with unknown objects in SGG. Meanwhile, the unknown-aware versions slightly reduced the number of The number of objects per image 0-correct 0-wrong 0-background 1-correct 1-wrong 1-background FIGURE 4: Open-set object counts@100. The "+" suffix indicates the methods with thresholding-based unknown detection. The "0-" and "1-" prefixes indicates known and unknown ground-truth objects, respectively.
correctly-classified known objects ("0-correct"). This result can be considered as a side effect of the unknown detection, suggesting room for improvement. We also observe that the simple thresholding could not recover undetected objects, which were treated as background ("1-background"). Overcoming this limitation requires the redesign of the object detection part of each model, thereby being another future research direction.
We also plot the results of OSRC@100 in Fig. 5. We first observe that all the original methods without unknown detection could not detect most ground-truth relationships and treated them as background ("0/1/2-background"), confirming the well-known difficulty of SGG compared with object detection. Consequently, their unknown-detecting versions, which can only change the object classes in detected relationships, could not improve so much. Still, these methods, especially VCTree+, managed to fix some wrongly-classified unknown objects ("1-correct" and "2-correct"). Expanding these areas of successful predictions is a main future issue of open-set SGG to go beyond the baseline established in this paper.

3) Visualization
For qualitative evaluation, we visualize the ground-truth and predicted scene graphs on examples of testing images in Figs. 6 to 8. Here, we show the predictions by the bestperforming model in Section IV-C(1), i.e., VCTree+. Following the IMP visualization [2], we show each predicted relationship only if both of its objects overlap with any The number of relationships per image 0-correct 0-wrong 0-background 1-correct 1-wrong 1-background 2-correct 2-wrong 2-background FIGURE 5: Open-set relationship counts@100. The "+" suffix indicates the methods with thresholding-based unknown detection. The "0-", "1-", and "2-" prefixes indicates the number of unknown objects in each ground-truth relationship.
ground-truth relationships, similarly to recall metrics such as SGDet. Here, to avoid cluttered visualization while focusing on open-set-specific factors, we consider only ground-truth relationships with any unknown object. Note that the object indices (e.g., "1" of "unknown1") in these figures are added just for the purpose of explanation and did not exist neither in ground-truth nor predicted graphs.
In Fig. 6, the unknown-aware method successfully detected the keyboard object, along with its "on" relationship with the "desk" object. Meanwhile, the PC object ("un-known4" in the ground truth) under the desk could not be detected, which is the limitation of the current thresholdingbased unknown detection that cannot recover the objects treated as background, as discussed in Section IV-C(2). In Fig. 7, the method succeeded in detecting relationships involving an unknown object ("unknown0" and "unknown1" in the ground truth and prediction, respectively), where the "man" and his "hand" are "holding" it. Here, the method could also find the reversed relationship that the object is "in" the "hand". Meanwhile, the "woman" was wrongly classified as "unknown" , which explains the side effect of the unknown detection, i.e., the decreased number of correctlyclassified known objects (the "0-correct" metric) observed in Section IV-C(2). In Fig. 8, the unknown object ("unknown5" and "unknown0", respectively) on the "building" wall was detected, but the relationship between the unknown object and the "building" object was predicted as "on" instead of the true class "in front of", which is semantically not critically  wrong but still affects the performance measured by metrics such as SGDet. This kind of class ambiguity is also an issue in closed-set SGG and has been addressed in recent studies [13], [14], [48], and may also be of interest in the future research of open-set SGG.

V. CONCLUSION
Previous SGG studies have ignored the existence of unknown objects, and thus the real-world performance of SGG has been limited. In this paper, we addressed the new problem of open-set SGG, which allows us to detect unknown objects and also relationships involving them. Specifically, we formalized the problem and proposed an evaluation protocol including a dataset and metrics. We also presented the first experimental results on open-set SGG by comparing original and modified versions of previous methods to establish the baseline of open-set SGG. We believe that these contributions facilitate future researches in this unexplored yet important problem and also extend the applicability of SGG to various real-world scenarios. Finally, we point out several future research directions of open-set SGG. While we employed the simple thresholding technique to build the baseline of unknown-aware versions, various unknown detection techniques have been proposed for open-set image recognition and object detection [38], [40], [44]. By appropriately combining these techniques with open-set SGG, we will be able to enhance unknown object detection and thereby relationship detection, hopefully dealing with the issues observed in Section IV, i.e., known objects classified into the "unknown" class and completely undetected objects. Meanwhile, importing the actively-developed techniques of conventional closedset SGG into open-set SGG, e.g., network architectures, losses, and learning techniques, which are orthogonal to this study, is also important to enhance the performance against both known and unknown objects. Among them, we believe that techniques to deal with the ambiguity of relationship classes [13], [14], [48], which we observed in Section IV-C(3), are particularly beneficial for open-set SGG. Although relationships have relatively less variety compared with objects, allowing unknown relationship classes as a further generalization of open-set SGG may also help mitigate difficulties due to ambiguous annotations inherent in largescale SGG datasets. Inspired by attribute-based zero-shot learning [24], the use of object attributes, which are already available in datasets like VG [6] but currently unexploited for SGG, may be useful in the classification of unknown objects e.g. by distinguishing unknown classes using their VOLUME 4, 2016 common visual attributes. Recent advancements in the use of 3D-spatial and temporal information [31]- [33] may further benefit open-set SGG targeted at real-world applications such as robot navigation.