Video Relationship Detection Using Mixture of Experts

Machine comprehension of visual information from images and videos by neural networks suffers from two limitations: (1) the computational and inference gap in vision and language to accurately determine which object a given agent acts on and then to represent it by language, and (2) the shortcoming in stability and generalization of the classifier trained by a single, monolithic neural network. To address these limitations, we propose MoE-VRD, a novel approach to visual relationship detection via a mixture of experts. MoE-VRD recognizes language triplets in the form of a < subject, predicate, object > tuple to extract the relationship between subject, predicate, and object from visual processing. Since detecting a relationship between a subject (acting) and the object(s) (being acted upon) requires that the action be recognized, we base our network on recent work in visual relationship detection. To address the limitations associated with single monolithic networks, our mixture of experts is based on multiple small models, whose outputs are aggregated. That is, each expert in MoE-VRD is a visual relationship learner capable of detecting and tagging objects. MoE-VRD employs an ensemble of networks while preserving the complexity and computational cost of the original underlying visual relationship model by applying a sparsely-gated mixture of experts, which allows for conditional computation and a significant gain in neural network capacity. We show that the conditional computation capabilities and massive ability to scale the mixture-of-experts leads to an approach to the visual relationship detection problem which outperforms the state-of-the-art.


Introduction
In the last decade, there has been a surge in research on the machine comprehension of visual information from images and video sequences.In particular, the application of large neural networks has allowed problems to be tackled such as video object segmentation [1][2][3][4], object recognition and classification [5][6][7][8][9], and action recognition [10][11][12][13][14].This unprecedented progress in the comprehension of visual information, however, suffers from the computational and inference gap between vision and language [15] to accurately determine which object a given agent acts on and how it might represent it by language.
We can begin by drawing inspiration from the two-streams hypothesis of the brain and how it processes visual information, such that the brain distinguishes between a ventral stream (the "what" pathway) and a dorsal stream (the "where" or "how" pathway) [16].In parallel with this distinction, natural languages contain two classes of verbs describing actions: manner verbs, describing how an action is performed by expressing cause, such as waving arms (implying cheering) or nodding head (for assent); as opposed to result verbs, that describe the result of an action by expressing their effect, such as move, heat, clean, enter etc. [17].
When an action results in nearby changes, the visual information processing problem consists of detecting the three inter-related entities of subject, predicate (action), and the object(s) involved; that is, to recognize language triplets in the form of a < subject, predicate, object > tuple.To detect the relationship between a subject (acting) and the object(s) (acted upon), the action must be recognized.Some recent approaches to visual relationship detection have focused on static images [24], however static relationship detection clearly has limitations in understanding temporal constraints inherent in video sequences, which offer significant richness regarding relationships [24,25,15].Therefore, there has been a significant research emphasis on detecting visual relationships in video sequences [25,26,24,[27][28][29]15].
The main challenges associated with this problem stem from the very large datasets, high ambiguity, and huge amount of background clutter.Moreover, the objects involved may only barely be recognizable due to pose, motion blur, occlusion and lighting.On the other hand, large variance in predicate representations [25] also makes it difficult to learn latent patterns, thus it is essential to consider visual and spatial features, and language ambiguity with synonyms.There is also a combinatorial effect, in that the number of unique tuple classes can be exceptionally large (the product of the vocabulary of subjects, objects, and predicates).
In this article, we propose an approach to video visual relationship detection (VidVRD), implemented by a multi-expert framework, where each expert is trained using the same model, and where the outputs of all experts are gated based on a separate neural network.Our performance results show that our novel architecture substantially outperforms known state-of-the-art methods.The contributions of this work are thus two-fold: 1. We construct a novel multi-expert detection framework.2. We capture recent developments in video visual relationship detection as experts in our proposed multi-expert architecture.Our proposed approach is not tuned to a particular choice of expert, and other choices of expert should be equally valid and applicable.
The rest of this paper is organized as follows: Section 2 further develops the VidVRD problem and overviews past work, Section 3 describes the encapsulation of an existing VidVRD approach into an expert [15], Section 4 describes the experiments and results which are discussed in Section 5.

Related work
Cascading failures caused by low-level misclassifications make a monolithic solution, such as a single very large neural network, not an ideal strategy to resolve large visual relationship detection problems.To address such limitations, earlier approaches proposed solutions such as dividing a given problem based on pre-processing and post-processing heuristics [25,15,24,30], to attempt to correctly classify predicates.
The capacity of a neural network to learn is limited by its degrees of freedom (number of parameters) and further limited by the available data.When datasets are, in fact, large enough, then increasing the number of parameters can lead to significant improvements in performance.However for a typical deep learning model, where the entire network is activated on each training sample, the computational cost is roughly quadratic in the number of parameters, as both the model size and the number of training samples increase together [31][32][33].This phenomenon is particularly and uniquely exacerbated in visual information processing since the input layer is very large.The statistical relationships of pixels and objects detected in images and videos are subtle, and the networks are often expected to perform multiple tasks like object segmentation and action recognition, all of which lead to significant increases in network size.
Limitations in computing power will eventually fall short of meeting the training demand, and the networks trained in this fashion tend to be brittle and sensitive to slight changes in the data distribution [34] and task specification [35].That is, current systems are better characterized as narrow experts rather than as resilient generalists [36].
With the above context in mind, we make the following observation about the VidVRD task: A training approach based on a single, monolithic network is not sufficient to have stable and generalizable classifiers, at least in certain problem contexts [34,36].Therefore, applying divide-and-conquer by developing multiple small models and aggregating their outputs could be a promising solution [37,38] to create more compact and/or resilient networks.As a consequence, the mixture-of-experts architecture suggests a strategy to achieve the larger network capacity needed to solve large numbers of sub-problems, which is typical of the VidVRD domain.

Video Relationship Detection
Video visual relationship detection (VidVRD) is made up of multiple problems that must be resolved simultaneously: object recognition, subject recognition, and action recognition.Hence, the detection often takes the form of classifying a triplet in the form of a< subject, predicate, object > tuple for each detected subject-object pair in a video.When dealing with static images, this problem is comparatively straightforward as it involves detecting such a triplet only once, using language and spatial features [39][40][41][42][43].In contrast, in video the problem becomes significantly more complicated as spatio-temporal features come into play, and relationships can change over time, cascading the difficulty of the problem [25,44,15].
In 2017, Shang et al. [25] proposed the first approach for VidVRD, decomposing a given video into segments and stitching relationship predictions in preceding and succeeding segments through a greedy association algorithm.They also introduced the first fully annotated video visual relationship dataset [25].Later, Qian et al. [26] proposed tackling the problem using a fully connected spatio-temporal graph.Tsai et al. [24] also constructed a spatio-temporal graph, but utilized Conditional Random Fields to exploit the statistical dependency between objects.Liu et al. [27] proposed a sliding-window scheme to simultaneously predict short-term and long-term relationships using different kernel sizes on object tracklets to generate sub-tracklet proposals with different durations.
Xiao et al. [29] proposed to spatio-temporally localize the relations (predicates) in a video sequence.Their proposed approach, the Visual Relation Grounding in Videos (vRGV) first produces region proposals from each frame of a video and then learns to ground a pre-defined relation from two trajectories.A trajectory is created by connecting the consecutive bounding boxes linked to a visual entity (subject or object) across a video segment.
Wu et al. [45] proposed using two graph-based networks to predict the spatial-temporal relations (actions) between subjects and objects in videos.They applied a gated graph network together with a long short-term graph network to, respectively, extract spatial relations within video frames and multi-scale temporal relations between consecutive frames.
Gao et al. [46] proposed a tracklet-based visual transformer composed of a temporal-aware decoder, which performs feature interactions between tracklets and predicate embeddings for relationship detection.Zheng et al. [47] also proposed the VRDFormer, in which a first module encodes a video into a sequential frame-level feature map, and a second one processes the sequential feature map in order to generate the relation instances.
Li et al. [48] proposed a method to address the long-tailed bias in VidVRD datasets, which results in poor generalization.Their approach, the Interventional Video Relation Detection (IVRD) applies causality-inspired intervention on the model input to decrease the effect of the spurious correlation in the training data, and therefore to enhance the robustness of the output prediction.
Cao et al. [49] proposed using comprehensive semantic representations that are useful for knowledge transfer across relationships to solve the VidVRD problem.Their approach, the Concept-Enhanced Relation Network (CKERN) MoE-VRD produces conceptually richer semantic representations of the detected object pairs, and then predicts the relationship based on the integration of multi-modal features.
Gao et al. [50] proposed a classification-then-grounding approach based on the temporal bipartite graphs of the videos, where the nodes are entities and predicates, and the edges denote different semantic roles between the nodes.Their proposed approach, the Bipartite Graph model (BIG) first classifies all of the nodes and edges of the graph (classification), and then localizes the temporal location of each relation instance (grounding).
Chen et al. [51] introduced a compositional encoding for VidVRD.Their proposed approach, the Social Fabric Encoding (SFE) encodes a pair of object "tubelets" as a composition of interaction primitives.Learning these primitives, the resulting representation is used to localize and classify relationships from co-occurring objects.
More recently, Shang et al. [15] modified their earlier approach [25], whereby they proposed an iterative relation inference that exploits the inter-dependency of relation components (subject/objects and predicates) for better visual relation recognition.To achieve this, they created three preferential predictors with learnable tensors alongside the normal visual predictors to model the inter-dependency relationship between subjects/objects and predicate classes [15].Hence, each relational component has three classifiers, and each of them consists of a visual predictor and a preferential predictor.The visual predictor is a deep neural network for recognizing the visual patterns of subject/object and predicate, whereas the preferential predictor refines the prediction of one variable (subject, object, or predicate) conditioned on the values of the other two [15].Following a similar architecture to their initial work [15], the authors take a sliding time window and generate object tracklet proposals as the detected entities, and then predict associated relation triplets.
Due to its lightweight network architecture, modularity, and state-of-the-art performance, we have chosen the work by Shang et al. [15] as the basis for the expert in the architecture proposed in this paper.

Mixture of Experts
Proposed more than three decades ago by Jacobs et al. [52], the mixture of experts (MoE) architecture has been applied to problems including the modeling of task relationships [53], increasing network breadth and depth [54], multi-modal generative models [55], and volunteer computing [56].
In 2017, Shazeer et al. proposed a new general purpose neural network component: the Sparsely-Gated Mixture-of-Experts (MoE) Layer [33], consisting of a number of experts, each a simple feed-forward neural network, together with a trainable gating network which selects a sparse subset of the experts to be trained on each given input [33].The gating network essentially determines which of the experts are best suited to a given type of input.All parts of the network, both gating and experts, are trained jointly by back-propagation [33].In their paper, Shazeer et al. applied their technique, illustrated in Figure 1, to language modeling and machine translation.
Riquelme et al. also proposed a Vision Mixture of Expert (V-MoE) [57] for image classification.V-MoE replaces a subset of feedforward layers in a vision transformer with sparse MoE layers, where each image patch is "routed" to a subset of "experts".This halves the computation consumption at inference while performing equally well as the state-of-the-art.
It has been observed [33,58,54] that a gating network is inclined to converge to a state where it produces large weights for the same few experts regardless of input, a phenomenon very much analogous to the problems encountered with self-organized maps [59] (essentially a very flat single-layer network) from pattern recognition.This imbalance becomes self-reinforcing / self-perpetuating, as the favored experts are trained more frequently / more rapidly and thus are even more likely to be selected by the gating network [33].To address this problem, Shazeer et al. [33] defined the importance of an expert relative to a batch of training samples to be the batch-wise sum of the gate values for that expert.
In the following section we will present our proposed architecture, which applies a sparsely gated mixture of experts [33] to video visual relationship detection [15].

Architecture
Figure 3 illustrates our proposed architecture and its main components.

Sparsely-Gated Mixture of Experts
The MoE consists of a set of N expert networks E 1 , ..., E N and one gating network, G, whose output is a sparse binary N -dimensional vector.The experts are themselves identical feed-forward neural networks, each with their own parameters.Since our interest in this paper is the MoE concept, for the individual experts we have adopt the baseline state-of-the-art approach of [15] shown in Figure 1.

MoE-VRD
Given an input x, the output of the ith expert's function is denoted as E i (x).These N outputs are combined in the MoE layer as where G(x) i represents the output of the gating network.The sparsity in computation, one of the key strengths of the MoE approach, is realized by the explicit sparsity of the gating output.That is, such that whenever G(x) i = 0 the corresponding expert E i is not invoked (the particular input x is not fed-forward into the network representing expert E i ).
There is significant flexibility in the choice of gating function.In this paper we adopt the absolute simplest case -a single-layer gating function, with more capable multi-layer generalizations to be considered as future work.The gating output G(x) is computed as where top K selects the K largest values (the best experts), and W i g and W i n are trainable gating and noise weight matrices, respectively, which are parametrized for each expert i.
The number of samples sent to the gating layer is discrete, and therefore not applicable to back-propagation, however the inclusion of the noise term N g (x) allows for a smooth estimate of the number of samples used for each expert in each batch, thus allowing for the back-propagation of gradients.The noise function is defined as where Softplus(x) = 1 β log(1 + βx) is a smooth approximation of the ReLU function to constrain the output to be positive.
An importance term is considered in the overall loss to address imbalances resulting from the self-reinforcing effect [60], which occurs when certain favoured experts are trained more rapidly and thus are selected even more by the gating network.The importance loss is L importance (x) = α CV(g) + CV(l) , where α is a hand-tuned scaling factor, g is the batch-wise sum of gate values over batch B, and represents the load, summed over the positive gate values.
We applied CV(•), the coefficient of variation, as additional loss terms in (5) which encourage experts to have a more balanced (equal) importance [33].

Visual relationship detection
To address the problem of visual relationship detection in video sequences, we take inspiration from VidVRD-II [15], which learns relationship detection using iterative inference shown in Figure 2.
VidVRD-II assumes a set of three entities E = [e 1 , e 2 , e 3 ] representing subject e 1 , predicate e 2 and object e 3 , and their corresponding features f e1 , f e2 , f e3 , which builds the language triplet < subject, predicate, object >, and models the problem of video visual relationship detection as the joint probability This joint probability can be factorized as three conditional probabilities P (e 1 |f e1 , e 2 , e 3 )P (e 2 |f e2 , e 1 , e 3 )P (e 3 |f e3 , e 1 , e 2 ), MoE-VRD Figure 2: Visual relationship detection framework proposed by [15], which is used as the basis of our expert.
which aids in inference when there is ambiguous visual information, since the classes of any two components imply a preference over the class of the third.
Each conditional probability of ( 10) is modelled by a classifier consisting of a visual predictor and a preferential predictor also shown in Figure 2. The visual predictor simply learns visual patterns of the subject e 1 , predicate e 2 , and object e 3 , using a deep neural network.The preferential predictor applies learnable dependency tensors to refine the prediction of one variable conditioned on the values of the other two: Here e pr is the conditional probability vector of three entities, and V = [V e1 , V e2 , V e3 ] are the learnable weights of the visual predictors.In our case study, the weights of the subject and object classifiers are shared, thus V e1 = V e3 .The paper of [15] applied the nonlinearity Φ to the entire expression of (11), as in However actual implementations applied Φ to the first term only, as in (11), a convention which we have preserved for consistency.The weights [W e1 , W e2 , W e3 ] model the dependency of one class over the other two, separately parametrized for each classifier.Φ represents the nonlinear activation, here implemented by a Softmax function for the subject and object classes, and a Sigmoid function for the predicate class.
To design our architecture, we apply the MoE concept of Section 3.1 by incorporating VidVRD-II into the MoE framework, giving rise to the combined MoE-VRD shown in Figure 3.The object tracklet proposals are extracted and fed into the relation prediction experts.Based on the gating layer, the top K experts are chosen.Upon the completion of the forward pass, back-propagation is applied so the gradients back-propagate through the gating network and the selected experts.

Object Tracklet Proposals
We use Seq-NMS [7] to generate object tracklet proposals as a pre-processing step to use as inputs to the relational classifier experts.For frame-level object detection, a Faster-RCNN with an Inception-ResNet foundation [61] is pretrained on the Open Images dataset [43].
The model serves as a suitably generic object detector [61,15].The bounding boxes and corresponding region features are extracted, after which Seq-NMS generates a compact set of object tracklets, which form the inputs to the expert neural networks.

Feature Extraction
Applying the object tracklet proposals, we generate two types of features: Visual Features and Relative Positional Features, shown in Figure 3.
To generate the visual features f of ( 9) -( 10), the bounding boxes are applied to extract the pretrained deep visual features of the subject and object entities, and the predicate's visual feature is computed through a concatenation of the subject and object visual feature vectors.

MoE-VRD
In addition to the visual features, we extract a relative positional feature to represent the spatio-temporal relationship between the entities.For each pair of object tracklets, the algorithm computes the relative distance between the subject and object by encoding the spatial and temporal relative positional feature f p r : where p ∈ [b, e] represents the beginning or ending bounding box, characterized by coordinates (x, y), width w, height h, and time t for subject e 1 and object e 3 .A feed-forward network is used to fuse the subject's and object's visual features f e1 , f e3 with the relative positional features of the beginning and ending bounding boxes f b r , f e r , where the relative positional feature f p r provides the expert with additional information to recognize visual relationships.In summary, each encapsulated expert consists of an object predictor, a subject predictor, and a predicate predictoreach of which is a basic feed-forward network, allowing for a set of modestly-sized, nimble experts to speed up training and inference, when compared to an equivalent single monolithic network.

Experiments and Results
In order to properly assess the improvements offered by our proposed framework and to make a fair comparison with the state-of-the-art, we conduct our experiments using a similar experimental setup and the same datasets as used by the iterative inference approach proposed by Shang et al. [15].

Evaluation metrics
In object detection there are two related problems to be solved: localization and classification.Localization determines the location of an object (e.g., its bounding box), whereas classification infers the object's identity.
For object detection tasks, it is standard to calculate precision and recall based on a given threshold on the IoU (Intersection over Union), which measures the fractional overlap between predicted and true bounding boxes.If the IoU result for a predicted bounding box exceeds the threshold, then the prediction is classified as a true positive, otherwise it is a false positive.
We can thus define metrics, such as Recall@50, implying an evaluation of the recall metric based on an IoU threshold of 50% (i.e., allowing bounding boxes to be only 50% overlapping).Other metrics, such as Recall@100, Precision@10 etc. are equivalently defined.
We designed and implemented our evaluation metrics consistent with those in previous works [25,62,15].To this end, we calculate how many ground truth relation instances are detected by the mixture of experts in each testing video.Metrics are divided into two categories: relation tagging and relation detection.
Relation tagging focuses on the precision of the relation triplet without considering the precision of its spatio-temporal location in the video.In other words, relation tagging simply checks whether the relationship was detected properly, but not whether it was detected with any accuracy in time or space.The tagging performance is evaluated by Precision@1 (P@1), Precision@5 (P@5), and Precision@10 (P@10).
In contrast, relation detection measures the precision of both the relation triplet and the corresponding subject/object trajectories for every detected relation instance [25].Relation detection of instances are considered to be correct only if they match the ground truth relation instance, and the voluminal Intersection-over-Union (vIoU) of the trajectories of subject and object are both larger than a pre-defined threshold.The vIoU threshold is set to 50% in order to address the objectives of our experiments.The detection performance is measured using the Mean Average Precision (mAP), Recall@50 (R@50), and Recall@100 (R@100).
All experiments are run ten times with different random seeds for each expert, and the mean plus/minus standarddeviation scores are reported.

Single Expert Performance
To ensure the proper functioning of the proposed mixture of experts, clearly we must first validate the performance of a single expert.The performance of the MoE framework consisting of only a single expert (N = 1, and thus necessarily K = 1) during training and testing should be essentially unchanged from the published performance of the underlying expert [25,15].for each given image frame the subject and object tracklets are extracted and given to the feature extraction network together with bounding box information in order to generate visual and relative positional features representing all three entities: subject, predicate and object.The visual and positional features are applied as the input to our experts and gating networks.Every expert outputs a score corresponding to each entity, which represents both visual and preferential predictions.The gating network outputs a sparsely gated vector, which evaluates each expert's learning.Selecting the top K experts, the sum-product of the sparsely gated expert scores is calculated and represented as the output of our MoE-VRD architecture.

Input
Table 1: Comparisons of a single expert with the method of by Shang et al. [15] on the ImageNet-VidVRD Dataset [25] (top) and on the VidOR Dataset [62] (bottom).In both cases, the expert performs essentially equivalently to that of [15].

ImageNet-VidVRD Dataset
Relation detection Relation tagging mAP R@50 R@100 P@1 P@5 P@10 In the multi-expert architecture, at every iteration the gating layer outputs a sparse vector selecting K experts.If there is only a single expert, then the gating layer is essentially irrelevant, simply selecting the same one expert every time, and the resulting performance should be the same as if we had simply run the model built by Shang et al. without any adjustments [15].
We have shown the results of the state-of-the-art approaches in Table 1, illustrating a comparison between the relation detection and relation tagging results of Shang et al.'s VidVRD-II [15] and our single-expert (N = 1) MoE-VRD architecture over the ImageNet-VRD and VidOR datasets.Both approaches perform quite similarly, validating that the MoE-VRF framework is not interfering with the operation of the underlying expert, allowing us to generalize to multiple experts, next.
Table 2: Performance of our proposed MoE-VRD with K = 2 and a total of N = 10 experts, in comparison with stateof-the-art approaches on the ImageNet-VidVRD dataset [25].For every criterion the proposed MoE-VRD outperforms all other approaches.The substantial increase in performance stems unambiguously from the mixture-of-experts approach, since our expert on its own is no better than the method in VidVRD-II, as was shown in Table 1."−" indicates that no corresponding results were reported.

Multi-expert Performance
We now wish to evaluate the performance of our proposed MoE-VRD architecture when more than one expert (N > 1) is at play.We evaluate our approach on the VidOR dataset [62] and ImageNet-VRD dataset [25].
We evaluate our work against recent representative approaches: • VidVRD-II [15], which builds upon the same authors' work in [25].It uses an iterative inference approach to video relationship detection.• GSTEG [24], which constructs a fully-connected spatio-temporal graph for relation inference.
• VRD-GCN [26], which builds a model that can take advantage of spatial-temporal contextual cues to make better predictions on objects as well as their dynamic relationships.• VRD-STGC [27], which proposes a novel sliding-window scheme to simultaneously predict short-term and long-term relationships [27], and extracts spatio-temporal features.• 3DRN [65], which develops a 3-D CNN to learn the visual features for relation recognition in an end-to-end manner.
• IVRD [48], which proposes a causality-inspired intervention on the model input to improve prediction robustness.
• CKERN [49], which generates comprehensive semantic representations by incorporating retrieved concepts with local semantics.• BIG [50], which proposes a classification-then-grounding approach based on temporal bipartite graphs.
• SFE [51], which proposes encoding the representation of a pair of objects as a composition of interaction primitives.Note that performance drops after K = 2; due to the averaging nature of the architecture before the final output, such that well-performing experts may become drowned out by more poorly performing peers if K is set too large.
For the ImageNet-VRD dataset [25] we compare to all ten of these methods; for the VidOR dataset [62] we compare to eight of the preceding methods, due to the choice of results reported in the respective papers.
Table 2 shows the results, comparing our proposed MoE architecture with all ten other methods on the ImageNet-VRD dataset [25].The proposed MoE-VRD performs significantly better, in every metric, than any method tested, including the most recent state of the art.The large margin of improvement stems from having a gating function that allows experts to be trained quite separately on different sorts of inputs, leading to a degree of robustness due to heterogeneity, which is very difficult for single large monolithic networks.
Similar to Table 2, Table 3 now shows the comparative results on the VidOR dataset [62].Our proposed MoE-VRD still exhibits superior performance in every metric when compared to most of the state of the art approaches, although by a lesser margin than in Table 2, likely due to the increased diversity of the VidOR dataset [62], and the related naivety or limitation of the Moe-VRD in using a set of identical experts.The creation of heterogeneous or differently-specialized experts is a subject for future research.
BIG [50], Ens-5 [46], and SFE [51] do outperform the proposed MoE-VRD for one or more metrics in Table 3, although for the Relation Detection assessment the MoE-VRD is highly competitive, outperforming BIG and Ens-5.In any event, a universal improvement on every possible dataset and/or metric is not to be expected, and the impressive results of SFE in Table 3 are, for example, significantly less impressive in its rather lackluster performance, relative to MoE-VRD, in Table 2.

Ablation study
Really the only aspect of the proposed architecture which can be removed, via ablation, is the collection of experts.
That is, our ablation study assesses performance as a function of the number of "top experts" K chosen by the gating function for each input.The total number of experts was fixed to N = 10, since N > K needs to be large enough to test a meaningful range of K, at the same time increasing N far past 10 leads to challenges in network memory requirements and training reliability.The results of the multi-expert experiments are presented in Figure 4, plotting mAP K, ablating K from 6 down to 1.
The best MoE results are achieved when we select the two top experts (K = 2) for each input, such that the performance drops with increased K for both datasets.Note that a low optimum value of K is an asset, not a liability, in that a small K implies a modest computational complexity, since only K experts are actually engaged for any given input.

Discussion & Conclusions
The problem of video-based visual relationship detection (Vid-VRD) is relatively new compared to static image-based visual relationship detection.The spatio-temporal dimensions in the video domain cascade the difficulty of the problem, given the far greater data volumes and the ability for relationships to change over time.There have been a few approaches to address this problem, however they uniformly rely on monolithic neural networks [24,26,44].
In this work, we developed a new framework, the MoE-VRD, based on a mixture of experts approach.MoE-VRD is developed by encapsulating a Vid-VRD framework [15] into an expert within a sparsely gated mixture of experts architecture.
Our proposed approach to video visual relationship detection also addresses limitations in computing power and distributed computation, which arises from the limited capacity of neural networks to absorb information due to the limitations in network size (number of parameters) in comparatively blunt architectures based on a single, monolithic network.
We have observed that the performance of the network drops when we select more than two experts (K > 2).We believe that this stems from the averaging operation, which acts prior to the final output, resulting in well-performing experts being increasingly drowned out by those experts having inferior performance.Studying this effect more carefully, and in other settings, is one subject for future work.
We achieved highly promising results from MoE-VRD based on experiments on two different datasets, ImageNet-VRD [25] and VidOR [62].Our MoE-VRD outperforms nearly all state-of-the-art approaches in most metrics, and outperforms every tested approach on the ImageNet-VRD dataset.
The proposed approach in this paper is perhaps still naive, as all of the experts are tackling the same problem.In principle one could imagine dividing the problem into smaller subproblems (with certain experts only aimed at subject/object recognition, for example), to address multi-modal datasets or to explore hierarchical MoE architecture, in which a primary gating network chooses a combination of experts -each of which itself is a secondary or tertiary MoE with its own respective set of experts and gating network [33].
Finally, almost certainly the gating network itself would benefit from further scrutiny.The gating network of this paper is the simplest possible choice, a single-layer feed-forward network, taking as input the same spatio-temporal object-tracklet features as are being provided to the experts.In a sense, it would seem that too much is being asked of the gating function, to go all the way from low-level input to expert-assessment output, such that the gating function tacitly must emulate or reproduce certain elements of expert behaviour.It would seem preferable to have the gating function operate at a higher / more abstracted level, and having certain aspects of expert-assessment made the responsibility of the expert networks themselves.

Figure 3 :
Figure 3: An illustration of the MoE-VRD architecture proposed in this article.Raw RGB images are taken as input;for each given image frame the subject and object tracklets are extracted and given to the feature extraction network together with bounding box information in order to generate visual and relative positional features representing all three entities: subject, predicate and object.The visual and positional features are applied as the input to our experts and gating networks.Every expert outputs a score corresponding to each entity, which represents both visual and preferential predictions.The gating network outputs a sparsely gated vector, which evaluates each expert's learning.Selecting the top K experts, the sum-product of the sparsely gated expert scores is calculated and represented as the output of our MoE-VRD architecture.

Figure 4 :
Figure4: mAP of the MoE-VRD approach having N = 10 experts, as a function of K during training.Note that performance drops after K = 2; due to the averaging nature of the architecture before the final output, such that well-performing experts may become drowned out by more poorly performing peers if K is set too large.