Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection system. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel modules, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets. The source code is available at https://github.com/coldmanck/RVL-BERT.


I. INTRODUCTION
Visual relationship detection (VRD) aims to detect objects and classify triplets of subject-predicate-object in a query image. It is a very crucial task for enabling an intelligent system to understand the content of images, and has received much attention over the past few years [1]- [18]. Based on VRD, Xu et al. [19] proposed scene graph generation (SGG) [20]- [29], which targets at extracting a comprehensive and symbolic graph representation in an image, with vertices and edges denoting instances and for visual relationships respectively. We focus on and use the term VRD throughout this paper for consistency. VRD is beneficial to various downstream tasks including but not limited to image captioning [30], [31], visual question answering [32], [33], etc.
The associate editor coordinating the review of this manuscript and approving it for publication was Eduardo Rosa-Molinar .
To enhance the performance of VRD systems, some recent works incorporate the external linguistic commonsense knowledge from pre-trained word vectors [3], structured knowledge bases [15], raw language corpora [9], etc., as priors, which has taken inspiration from human reasoning mechanism. For instance, for a relationship triplet case person-ride-bike as shown in Figure 1, with linguistic commonsense, the predicate ride is more accurate for describing the relationship of person and bike compared with other relational descriptions like on or above, which are rather abstract. In addition, we argue that the external visual commonsense knowledge is also beneficial to lifting detection performance of the VRD models, which is however rarely considered previously. Take the same person-ride-bike in Figure 1 as an example. If the pixels inside the bounding box of person are masked (zeroed) out, humans can still predict them as a person since we have seen many examples and have plenty of visual commonsense VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ regarding such cases. This reasoning process would be helpful for VRD systems since it incorporates relationships of the basic visual elements; however, most previous approaches learn visual knowledge only from target datasets and neglect external visual commonsense knowledge in abundant unlabeled data. Inspired by the recent successful visual-linguistic pre-training methods (BERT-like models) [34], [35], we propose to exploit both linguistic and visual commonsense knowledge from Conceptual Captions [36] -a large-scale dataset containing 3.3M images with coarsely-annotated descriptions (alt-text) that were crawled from the web, to achieve boosted VRD performance. We first pre-train our backbone model (multimodal BERT) on Conceptual Captions with different pretext tasks to learn the visual and linguistic commonsense knowledge. Specifically, our model mines visual prior information via learning to predict labels for an image's subregions that are randomly masked out. The model also considers linguistic commonsense knowledge through learning to predict randomly masked out words of sentences in image captions. The pre-trained weights are then used to initialize the backbone model and trained together with other additional modules (detailed at below) on visual relationship datasets. Besides visual and linguistic knowledge, spatial features are also important cues for reasoning over object relationships in images. For instance, for A-on-B, the bounding box (or it's center point) of A is often above that of B. However, such spatial information is not explicitly considered in BERT-like visual-linguistic models [34], [35], [37]. We thus design two additional modules to help our model better utilize such information: a mask attention module and a spatial Module. The former predicts soft attention maps of target objects, which are then used to enhance visual features by focusing on target regions while suppressing unrelated areas; the latter augments the final features with bounding boxes coordinates to explicitly take spatial information into account.
We integrate the aforementioned designs into a novel VRD model, named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT).
RVL-BERT makes use of the pre-trained visual-linguistic representations as the source of visual and language knowledge to facilitate the learning and reasoning process on the downstream VRD task. It also incorporates a novel mask attention module to actively focus on the object locations in the input images and a spatial module to capture spatial relationships more accurately. Moreover, RVL-BERT is flexible in that it can be placed on top of any object detection model.
Our contribution in this paper is three-fold. Firstly, we are among the first to identify the benefit of visual-linguistic commonsense knowledge to visual relationship detection, especially when objects are occluded. Secondly, we propose RVL-BERT -a multimodal VRD model pre-trained on visual-linguistic commonsense knowledge bases learns to predict visual relationships with the attentions among visual and linguistic elements, with the aid of the spatial and mask attention module. Finally, we show through extensive experiments that the commonsense knowledge and the proposed modules effectively improve the model performance, and our RVL-BERT achieves competitive results on two VRD datasets.

II. RELATED WORK A. VISUAL RELATIONSHIP DETECTION
Visual relationship detection (VRD) is a task reasoning over the relationships between salient objects in the images. Understanding relationships between objects is essential for other vision tasks such as action detection [38], [39] and image retrieval [40], [41]. Recently, linguistic knowledge has been incorporated as guidance signals for the VRD systems. For instance, [3] proposed to detect objects and predicates individually with language priors and fuse them into a higher-level representation for classification. [6] exploited statistical dependency between object categories and predicates to infer their subtle relationships. Going one step further, [15] proposed a dedicated module utilizing bi-directional Gated Recurrent Unit to encode external language knowledge and a Dynamic Memory Network [42] to pick out the most relevant facts.
However, none of the aforementioned works consider external visual commonsense knowledge, which is also beneficial to relationship recognition. By contrast, we propose to exploit the abundant visual commonsense knowledge from multimodal Transformers [43] learned in pre-training tasks to facilitate the relationship detection in addition to the linguistic prior.

B. REPRESENTATION PRE-TRAINING
In the past few years, self-supervised learning which utilizes its own unlabeled data for supervision has been widely applied in representation pre-training. BERT, ELMo [44] and GPT-3 [45] are representative language models that perform self-supervised pre-training on various pretext tasks with either Transformer blocks or bidirectional LSTM. More recently, increasing attention has been drawn to multimodal (especially visual and linguistic) pre-training.
Based on BERT, Visual-Linguistic BERT (VL-BERT) [37] pre-trains a single stream of cross-modality transformer layers from not only image captioning datasets but also language corpora. It is trained on BooksCorpus [46] and English Wikipedia in addition to Conceptual Captions [36]. We refer interested readers to [37] for more details of VL-BERT.
In this work, we utilize both visual and linguistic commonsense knowledge learned in the pretext tasks. While VL-BERT can be applied to training VRD without much modification, we show experimentally that their model does not perform well due to lack of attention to spatial features. By contrast, we propose to enable knowledge transfer for boosting detection accuracy and use two novel modules to explicitly exploit spatial features.

III. METHODOLOGY
A. REVISITING BERT AND VL-BERT Let a sequence of N embeddings x = {x 1 , x 2 , . . . , x N } be the features of input sentence words, which are the summation of token, segment and position embedding as defined in BERT [47]. The BERT model takes in x and utilizes a sequence of n multi-layer bidirectional Transformers [43] to learn contextual relations between words. Let the input feature at layer l denoted as x l = {x l 1 , x l 2 , . . . , x l N }. The feature of x at layer (l + 1), denoted as x l+1 , is computed through a Transformer layer which consists of two sub-layers: 1) a multi-head self-attention layer plus a residual connectioñ where A m i,j ∝ (Q l+1 m x l i ) T (K l+1 m x l j ) represents a normalized dot product attention mechanism between the i-th and the j-th feature at the m-th head, and 2) a position-wise fully connected network plus a residual connectioñ where GELU is an activation function named Gaussian Error Linear Unit [48]. Note that Q (Query), K (Key), V (Value) are learnable embeddings for the attention mechanism, and W and b are learnable weights and biases respectively. Based on BERT, VL-BERT [37] adds O more multi-layer Transformers to take in additional k visual features. The input embedding becomes x = {x 1 , . . . , x N , x N +1 , . . . , x N +O }, which is computed by the summation of not only the token, segment and position embeddings but also an additional visual feature embedding which is generated from the bounding box of each corresponding word. The model is then pre-trained on two types of pretext tasks to learn the visual-linguistic knowledge: 1) masked language modeling with visual clues that predicts a randomly masked word in a sentence with image features, and 2) masked RoI classification with linguistic clues that predicts the category of a randomly masked region of interest (RoI) with linguistic information. Figure 2 shows the overall architecture of our proposed RVL-BERT. For the backbone BERT model, we adopt a 12-layer Transformer and initialize it with the pre-trained weights of VL-BERT for visual and linguistic commonsense knowledge. Note that while our model is inspired by VL-BERT, it differs in several important aspects: 1) RVL-BERT explicitly arranges query object pairs in sequences of subject-predicate-object (instead of sentences in the original design) and receives an extra answer segment for relationship prediction. 2) Our model is equipped with a novel mask attention module that learns attention-guided visual feature embeddings for the model to attend to target object-related area. 3) A simple yet effective spatial module is added to capture spatial representation of subjects and objects, which are of importance in spatial relationship detection.

B. OVERVIEW OF PROPOSED MODEL
Let N , A and O denote the number of elements for the relationship linguistic segment, the answer segment, and the relationship visual segment, respectively. Our model consists of N + A + O multi-layer Transformers, which takes in a sequence of linguistic and visual elements, including the output from the mask attention module, and learns the context of each element from all of its surrounding elements. For instance, as shown in Figure 2, to learn the representation of the linguistic element goose, the model looks at not only the other linguistic elements (e.g., to the right of and window) but also all visual elements (e.g., goose, window). Along with the multi-layer Transformers, the spatial module extracts the location information of subjects and objects using their bounding box coordinates. Finally, the output representation of the element in the answer segment, h so , is augmented with the output of the spatial module C so , followed by classification with a 2-layer fully connected network.
The input to the model can be divided into three groups by the type of segment, or four groups by the type of embedding. We explain our model below from the segment-view and the embedding-view, respectively.

1) INPUT SEGMENTS
For each input example, RVL-BERT receives a relationship linguistic segment, an answer segment, and a relationship visual segment as input.
a) Relationship linguistic segment (light blue elements in Figure 2) is the linguistic information in a triplet form subject-predicate-object, like the input form of SpatialSense dataset [49], or a doublet form subject-object like the input in VRD dataset [3]). Note that each term in the triplet or doublet may have more than one element, such as to the right of. This segment starts with a special element ''[CLS]'' that stands for VOLUME 9, 2021 FIGURE 2. Architecture illustration of proposed RVL-BERT for SpatialSense dataset [49]. It can be easily adapted for VRD dataset [3] by replacing triplets subject-predicate-object with doublets subject-object and performing predicate classification instead of binary classification on the output feature of ''[MASK]''. c) Relationship visual segment (tangerine color elements in Figure 2) is the visual information of a relationship instance, also taking the form of triplets or doublets but with each component term corresponding to only one element even if its number of words of the corresponding label is greater than one.

2) INPUT EMBEDDINGS
There are four types of input embeddings: token embedding t, segment embedding s, position embedding p, and (attention-guided) visual feature embedding v. Among them, the attention-guided visual feature embedding is newly introduced while the others follow the original design of VL-BERT. We denote the input of RVL-BERT as x = a) Token Embedding. We transform each of the input words into a d-dimensional feature vector using WordPiece embeddings [50] comprising 30, 000 distinct words. In this sense, our model is flexible since it can take in any object label with any combination of words available in Word-Piece. Note that for those object/predicate names with more than one word, the exact same number of embeddings is used. For the i-th object/predicate name in an input image, we denote the token embedding as t = {t 1 , . . . , t N , t N +1 , . . . , d) Visual Feature Embedding. These embeddings are to inform the model of the internal visual knowledge of each input word. Given an input image and a set of RoIs, a CNN backbone is utilized to extract the feature map, which is prior to the output layer, followed by RoI Align [51] to produce fixed-size feature vectors z = {z 0 , z 1 , . . . , z K }, z i ∈ R d for K RoIs, where z 0 denotes the feature of the whole image. For triplet inputs, we additionally generate K (K − 1) features for all possible union bounding boxes: u = {u 1 , . . . , u K (K −1) }, u i ∈ R d . We denote the input visual feature embedding as To better capture distinct visual information for elements in relationship linguistic segment, we propose a mask attention module to learn to generate attention-guided visual feature embeddings that attend to important (related) regions, which is detailed at below.

C. MASK ATTENTION MODULE
An illustration of the mask attention module is shown in Figure 3. Denote the visual feature (the feature map before average pooling) used by the mask attention module as v s ∈ R d c ×d w ×d h , where d c , d w , d h stand for the dimension of the channel, width, and height, respectively. To generate the feature for an object s (e.g., goose in Figure 3), the mask attention module takes in and projects the visual feature v s and the word embedding 2 w s into the same dimension using a standard CNN and a replication process, respectivelỹ where Replication(·) replicates the input vector of size d into the feature map of dimension d × d w × d h . The above is followed by element-wise addition to fuse the features, two convolutional layers as well as a re-scaling process to generate the attention mask m s where the min-max Norm(·) applied to each element is defined by Norm(x i ) = max(x)−min(x) . Note that in the 2 Note that for object labels with more than one word, the embeddings of each word are element-wise summed in advance. above equations all of the W 's and b's are learnable weights and biases of the convolutional layers, respectively. The attention-guided visual feature v att s is then obtained by performing Hadamard product between the visual feature and the attention mask: v att s = v s • m s . Finally, v att s is pooled into v att s ∈ R d to be used in {v 2 , . . . , v N −1 }. To learn to predict the attention masks, we train the module against the Mean Squared Error (MSE loss) between the mask m s and the resized ground truth mask b s consisting of all ones inside the bounding box and outside all zeros: where d w , d h denote the width and length of the attention mask.

D. SPATIAL MODULE
The spatial module aims to augment the output representation with spatial knowledge by paying attention to bounding box coordinates. See the top part of Figure 2 for its pipeline.
The spatial module takes in coordinate vectors of a subject s and an object o, and encodes them using linear layers followed by element-wise addition fusion and a two-layer, fully-connected layer The output feature C so is then concatenated with the multimodal feature h so to produce f so for answer classification:

IV. EXPERIMENTS A. DATASETS
We first ablate our proposed model on VRD dataset [3], which is the most widely used benchmark. For comparison with previous methods, we also evaluate on SpatialSense [49] dataset. Compared with Visual Genome (VG) dataset [52], SpatialSense suffers less from the dataset language bias problem, which is considered a distractor for performance evaluation -in VG, the visual relationship can be ''guessed'' even without looking at the input image [21], [25]. 3

1) VRD
The VRD dataset consists of 5,000 images with 37,993 visual relationships. We follow [3] to divide the dataset into a training set of 4,000 images and a test set of 1,000 images, while only 3,780 and 955 images are annotated with relations, respectively. For all possible pairs of objects in an image, our model predicts by choosing the best-scoring predicate and records the scores, which are then used to rank all predictions in the ascending order. Since the visual relationship annotations in this dataset are far from exhaustive, we cannot use precision or average precision as they will penalize correct detections without corresponding ground truth. Traditionally, Recall@K is adopt to bypass this problem and we follow this practice throughout our experiments. For VRD, the task named Predicate Detection/Classification measures the accuracy of predicate prediction given ground truth classes and bounding boxes of subjects and objects independent of the object detection accuracy. Following [3], [7], we use Recall@K, or the fraction of ground truth relations that are recalled in the top K candidates. K is usually set as 50 or 100 in the literature.

2) SPATIALSENSE
SpatialSense is a relatively new visual relationship dataset focusing on especially spatial relations. Different from Visual Genome [52], SpatialSense is dedicated to reducing dataset bias, via a novel data annotation approach called Adversarial Crowdsourcing which prompts annotators to choose relation instances that are hard to guess by only looking at object names and bounding box coordinates. SpatialSense defines nine spatial relationships above, behind, in, in front of, next to, on, to the left of, to the right of, and under, and contains 17,498 visual relationships in 11,569 images. The task on SpatialSense is binary classification on given visual relationship triplets of images, namely judging if a triplet subject-predicate-object holds for the input image. Since in SpatialSense the number of examples of ''True'' equals that of ''False'', the classification accuracy can be used as a fair measure. We follow the original split in [49] to divide them into 13,876 and 3,622 relations for training and test purposes, respectively.

B. IMPLEMENTATION
For the backbone model, we use BERT BASE 4 that is pre-trained on three datasets including Conceptual Captions [36], BooksCorpus [46] and English Wikipedia. For extracting visual embedding features, we adopt Fast R-CNN [53] (detection branch of Faster R-CNN [54]). We randomly initialize the final two fully connected layers and the newly proposed modules (i.e., mask attention module and spatial module). During training, we find our model empirically gives the best performance when freezing the parameters of the backbone model and training on the newly introduced modules. We thus get a lightweight model compared to the original VL-BERT as the number of trainable parameters is reduced by around 96%, i.e., down from 161.5M to 6.9M and from 160.9M to 6.4M when trained on the SpatialSense dataset and the VRD dataset, respectively.
ReLU is used as the nonlinear activation function σ . We use d = 768 for all types of input embeddings, d c = 2048 for the dimension of channel of the input feature map and d w = d h = 14 for the attention mask in the mask attention module. The training loss is the sum of the softmax cross-entropy loss for answer classification and the MSE loss for the mask attention module. The experiments were conducted on a single NVIDIA Quadro RTX 6000 GPU in an end-to-end manner using Adam [55] optimizer with initial learning rate 1 × 10 −4 after linear warm-up over the first 500 steps, weight decay 1 × 10 −4 and exponential decay rate 0.9 and 0.999 for the first-and the second-moment estimates, respectively. We trained our model for 60 and 45 epochs for VRD and SpatialSense dataset, respectively, as there are more images in the training split of SpatialSense. For experiments on the VRD dataset, we followed the training practice in [14] to train with an additional ''no relationship'' predicate and for each image we sample 32 relationships with the ratio of ground truth relations to negative relations being 1 : 3.

1) TRAINING OBJECTIVE FOR MASK ATTENTION MODULE
We first compare performance difference between training the mask attention module (MAM) against MSE loss or binary cross entropy (BCE) loss. The first two rows of Table 1 show that MSE outperforms BCE by relative 3.8% on Recall@50. We also observe that training with BCE is relatively unstable as it is prone to gradient explosion under the same setting.

2) FEATURE COMBINATION
We also experiment with different ways of feature combination, namely, element-wise addition and concatenation of the features. To perform the experiments, we modify Eqn. 12 as f so = αC so + (1 − α)h so , and we experiment with different α values (.3, .5 and .7). The last four rows of Table 1 show that concatenation performs slightly better than addition under all α values.
The setting in the second row of Table 1 empirically gives the best performance, and thus we stick to this setting for the following experiments.

3) MODULE EFFECTIVENESS
We ablate the training strategy and the modules in our model to study their effectiveness. VL indicates that the RVL-BERT utilizes the external multimodal knowledge learned in the pretext tasks via weight initialization. Spatial (S) means the spatial module, while Mask Att. (M) stands for the mask attention module. Table 2 shows that each module effectively helps boost the performance. The visual-linguistic commonsense knowledge lifts the Basic model by 12% (or absolute 5%) of Recall@50 on VRD dataset, while the spatial module further boosts the model by more than 23% (or absolute 10%). As the effect of the mask attention module is not apparent on the VRD dataset (0.2% improvement), we also experiment on the SpatialSense dataset (Overall Accuracy) and find the mask attention module provide a relative 1% boost of accuracy.

D. QUANTITATIVE RESULTS ON VRD DATASET
We conduct experiments on VRD dataset to compare our method with existing approaches. Visual Phrase [1] represents visual relationships as visual phrases and learns appearance vectors for each category for classification. Joint CNN [3] classifies the objects and predicates using only visual features from bounding boxes. VTransE [7] projects objects and predicates into a low-dimensional space and models visual relationships as a vector translation. PPR-FCN [10] uses fully convolutional layers to perform relationship detection. Language Priors [3] utilizes individual detectors for objects and predicates and combines the results for classification. Zoom-Net [11] introduces new RoI Pooling cells to perform message passing between local objects and global predicate features. TFR [12] performs a factorization process on the training data and derives relational priors to be used in VRD. Weakly [8] adopts a weakly-supervised clustering model to learn relations from image-level labels. LK Distillation [9] introduced external knowledge with a teacher-student knowledge distillation framework. Jung et al. [13] propose a new spatial vector with element-wise feature combination to improve the performance. UVTransE [14] extends the idea of vector translation in VTransE with the contextual information of the bounding boxes. MF-URLN [16] uses external linguistic knowledge and internal statistics to explore undetermined relationships. HGAT [18] proposes a TransE-based multi-head attention approach performed on a fully-connected graph. Table 3 shows the performance comparison on the VRD dataset. 5 It can be seen that our RVL-BERT achieves competitive Recall@50/100 (53.07/55.55) compared to most of the existing methods, while lags behind the latest state-ofthe-art, such as MF-URLN and HGAT. We note that the use of an additional graph attention network in HGAT and a confidence-weighting module in MF-URLN can be possibly incorporated into our design, while we leave for future work.

E. QUANTITATIVE RESULTS ON SPATIALSENSE DATASET
We compare our model with various recent methods, including some methods that have been compared in the VRD experiments. Note that L-baseline, S-baseline and L+S-baseline are baselines in [49] taking in simple language and/or spatial features and classifying with fully-connected layers. ViP-CNN [4] utilizes a phrase-guided message passing structure to model relationship triplets. DR-Net [6] exploits statistical dependency between object classes and predicates. DSRR [17] is a concurrent work 6 that exploits depth information for relationship detection with an additional depth estimation model. The Human Performance result is extracted from [49] for reference. Table 4 shows that our full model outperforms almost all existing approaches in terms of the overall accuracy and obtains the highest (like in and in front of) or second-highest accuracy for several relationships. While the concurrent work DSRR achieves a slightly higher overall recall, we expect our model to gain another performance boost with the additional depth information introduced in their work.  [49]. †Note that DSRR [17] is a concurrent work that was published in August 2020.   Figure 4 shows qualitative comparisons between predicting visual relationships with and without the visual-linguistic commonsense knowledge in our model. Especially, the example (a) in the figure shows that, with linguistic commonsense knowledge, a person is more likely to wear a shoe, rather than pants to wear a shoe. That is, the conditional probability p(wear|(person-shoe)) becomes higher, while p(wear|(pants-shoe)) becomes lower and does not show up in top 100 confident triplets after observing the linguistic fact. Same applies to the example (b) (where person-wear-pants is more appropriate than person-in front of-pants) and the example (c) (where tower-has-clock is semantically better than tower-above-clock). On the other hand, as the person in the example (e) is visually occluded, the model without visual commonsense knowledge prefers to dog-wear-shoe rather than person-wear-shoe; however, our model with the visual knowledge knows that the occluded part is likely to be a person and is able to make correct predictions. Same applies to the example (d) (where both pillow and sofa are not clear) and (f) (where person is obscure). These examples demonstrate the effectiveness 50448 VOLUME 9, 2021 FIGURE 5. Attention map visualization of SpatialSense (the first two rows) and VRD dataset (the last two rows). For each example, the first row shows predicted attention maps while the second shows ground truth bounding boxes. of our training strategy of exploiting rich visual and linguistic commonsense knowledge by pre-training on unlabeled visual-linguistic datasets.

G. QUALITATIVE RESULTS OF MASK ATTENTION MODULE
The mask attention module aims to teach the model to learn and predict the attention maps emphasizing the locations of the given object labels. To study its effectiveness, we visualize the attention maps that are generated by the mask attention module during testing on the both datasets in Figure 5. The first two rows show two examples from SpatialSense, while the last two rows show three examples from the VRD dataset. Since the input embeddings of the model for the SpatialSense dataset in the form of triplets subject-predicate-object, and the VRD dataset in the form of doublets subject-object are different, three and two attention maps are generated for each example, respectively.
For both datasets, the model is actively attending to the region that contains the target object. Especially for the triplet data from SpatialSense, the model is also looking at the union bounding boxes which include cover both subjects and objects. For example, for the top-left example head-in front of-man in Figure 5, mask attention first looks at the person's head who is getting a haircut, followed by attending to the joint region of head and the barber (man), then finally focus on the barber man. We also observe that, for objects (classes) with larger size, mask attention tends to look more widely at the whole image. For instance, water of the top-right example and sky of the bottom-right example attend to almost the whole image. We conjecture that this is due to the larger size of objects or regions making it harder to learn to focus on the specific target areas. In addition, We also find the mask attention module learns better with triplet inputs than doublets inputs and this is assumedly because the additional examples of union boxes provide more contexts and facilitate the learning process.

V. CONCLUSION
In this paper, we proposed a novel visual relationship detection system named RVL-BERT, which exploits visual commonsense knowledge in addition to linguistic knowledge learned during self-supervised pre-training. A novel mask attention module is designed to help the model learn to capture the distinct spatial information and a spatial module is utilized to emphasize the bounding box coordinates. Our model is flexible in the sense that it can be solely used for predicate classification or cascaded with any state-of-the-art object detector. We have shown that the effectiveness of the proposed modules and its competitive performance through ablation studies, quantitative and qualitative experiments on two challenging visual relationship detection datasets.  For future work, we propose that the underlying model (i.e., BERT) can be re-designed to accommodate all proposals so that it could forward all pairs altogether instead of pairwise predictions. We also anticipate that more visual and/or linguistic commonsense databases could be utilized for pre-training our model.

A. RESULTS ON VISUAL GENOME
We provide additional experimental results on Visual Genome (VG) dataset [52]. We follow [19], [21], [29] to adopt the most widely-used dataset split which consists of 108K images and includes the most frequent 150 object classes and 50 predicates. When evaluating visual relationship detection/scene graph generation on VG, there are three common evaluation modes including (1) Predicate Classification (PredCls): ground truth bounding boxes and object labels are given, (2) Scene Graph Classification (SGCls): only ground truth boxes given, and (3) Scene Graph Detection (SGDet): nothing other than input images is given. We experiment with PredCls, which is a similar setting to what we perform on VRD dataset [3] and SpatialSense [49] dataset.
Comparison results on VG are presented in Table 5, where our proposed RVL-BERT achieves competitive results. Note that as mentioned in section IV-A, visual relationship detection can be biased as predicates could be ''guessed'' accurately given explicit correlations between object labels and predicates. Both SMN [21] and KERN [25] exploit this property and use the frequency bias and object co-occurrence, respectively. However, the usage of bias could reversely undermine the capability of generalization which has been demonstrated by comparing mean recall in recent works (e.g., [29]).