Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning

Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Besides, most methods adopt frame-level inter-tangled features among objects and ambiguous descriptions during training, which is difficult for learning the vision-language relationships. Without identifying the classes and locations of separate objects, neither associating the transition trajectories among the objects, these data-driven image-based video captioning methods cannot reason the activities with visual features only, not to mention performing well under small-samples. We propose a novel task, namely the object-oriented video captioning, which focuses on understanding the videos in object-level. We re-annotate the object-oriented video captioning dataset (Object-Oriented Captions) with object-sentence pairs to facilitate more effective cross-modal learning. Thereafter, we design the video-based structured trajectory network via adversarial learning (STraNet) to effectively analyze the activities along the time domain and proactively capture the vision-language connections under small datasets. The proposed STraNet consists of four components: the structured trajectory representation, the attribute explorer, the attribute-enhanced caption generator, and the adversarial discriminator. The high-level structured trajectory representation provides useful supplement over previous image-based approaches, allowing to reason the activities from the temporal evolution of visual features and the dynamic movement of spatial locations. The attribute explorer helps to capture discriminative features among different objects, with which the subsequent caption generator can generate more informative and accurate descriptions. Finally, by adding an adversarial discriminator on the caption generation task, we can improve learning the inter-relationships between the visual contents and the corresponding visual words. To demonstrate the effectiveness, we evaluate the proposed method on the new dataset and compare it with the state-of-the-arts for video captioning. From the experimental results, the STraNet exhibits the ability of precisely describing the concurrent objects and their activities in details.


I. INTRODUCTION
With the rapid growth of videos on the Internet, it becomes more important to automatically understand the videos. Video captioning, which calls for a systematic description of the videos, poses an intriguing challenge to learn the connections between vision and natural language [1]- [8], [10], [12]- [18], [28]- [34].
The associate editor coordinating the review of this manuscript and approving it for publication was Zhenhua Guo .
Previously, video captioning offers a holistic description of the entire video with less detailed information associated with objects, and this kind of video captioning cannot provide information of concurrent objects and activities as well. Actually, while human watching videos, instead of focusing on the FIGURE 1. In natural videos, there are multiple concurrent objects and activities. We propose the object-oriented video captioning aimed at understanding the videos in object-level, which can describe concurrent objects and activities with more details.
entire video or getting a description for an uncertain object, we prefer to reason detailed information of specific objects and the associated attributes or actions based on different interests. Therefore, comparing to giving a coarse description of the entire video or a sporadic object, it seems to be more meaningful to distinguish different objects and allow detailed descriptions for objects in the video.
So far, most works of video captioning are frame-level based approaches under the encoder-decoder structure and use video-sentence pairs for training [1]- [7], [12], [14]- [17], [19]. Each video is annotated with multiple sentences for various objects. However, with the frame-level features, which are inter-tangled among all the objects and the ambiguous descriptions, it is difficult for the captioning system to find the paired objects and sentences. Therefore, when multiple objects exist, most previous methods can neither accurately learn the connections between the vision and text, nor can they generate precise descriptions for different objects with detailed information. These methods rely more on training under large amounts of labeled data to learn the visionlanguage relationships. In reality, there are not always available sufficient training data in the many application scenarios. Therefore, the applicability of these data-driven methods are quite limited.
Moreover, the features of a single frame are independently extracted without definite temporal associations of objects among consecutive frames in previous works. Therefore, besides the visual features, most works adopt the models for action recognition [20]- [22] to extract the spatio-temporal 3D convolutional (C3D) features as temporal cues [12], [15], [16], [18], [23]. Nonetheless, most datasets for action recognition only consist of actions of human. However, in the datasets for video captioning, the main objects sometimes are not humans, e.g., animals, vehicles. Moreover, there is usually only a single object which has activities in the videos for action recognition, whereas in videos for video captioning, most videos have multiple concurrent objects, and all the objects have individual or interactive activities. Hence, using the pre-trained models for action recognition to extract temporal information on the videos for video captioning brings two problems: one is it cannot extract effective features under the situation of multi-objects; the other is that it cannot capture the action information of the objects which are not human. Besides, the C3D features and the visual features are usually independent, which means that these two kinds of features fed into a recurrent neural network (RNN) at the same time step always represent the information of different frames, resulting in confusion in the training of neural networks.
To deepen the research, we propose a novel task, named object-oriented video captioning, to transform the framelevel video captioning to object-level video captioning. We further design an object-level structured trajectory network (STraNet) via adversarial learning, which can achieve promising performance under small-samples and adapt to more application scenarios. The main contributions of our work are three-fold: We shift the holistic frame-level video captioning to objectlevel video captioning. Instead of a coarse and holistic description of the entire video or a sporadic object, we aim at understanding the video in object-level, which is closer to human experience while watching videos. Understanding the activities in object-level leads to a greater understanding of the video contents, which deserves more of our attentions.

B. VIDEO-BASED STRUCTURED TRAJECTORY NETWORK VIA ADVERSARIAL LEARNING
We design the object-oriented structured trajectory network (STraNet) via adversarial learning to replace previous data-driven image-based methods. Since the actions are continuous in the time domain, we focus more on understanding the rules and the temporal evolution in the trajectories to avoid the need of training under large amounts of data. We first design the structured trajectory representation, with which we can achieve the aim of reasoning activities without supplementary cues from other tasks. The structured trajectory representation improves the model ability of understanding the activities from the temporal evolution of visual features, allowing us to achieve real video understanding. Second, to capture more discriminative features among different objects and explore precise attributes, we further design the attribute explorer and attribute-enhanced caption generator. Finally, we modify the adversarial learning scheme on top of the caption generation task to help bridging the visual contents and the corresponding visual words.

C. OBJECT-ORIENTED VIDEO CAPTIONING DATASET
All available datasets for classical video captioning just have the video-sentence pairs. With the uncertain one-to-many video-sentences pairs, it is difficult to learn the visionlanguage connections. We construct a new dataset, Object-Oriented Captions, with consistent object-sentence pairs. With this kind of data, we can learn the vision-language functional and translational relationships more effectively.
We perform experiments on the new dataset and compare with the state-of-the-arts for video captioning.
The experimental results demonstrate that our proposed approach can achieve the superior performance in terms of BLEU@4, METEOR, CIDEr and ROUGE-L metrics [24]- [27]. More importantly, our proposed method can understand most concurrent activities with more details rather than giving a holistic description.
The paper is organized as follows. Section 2 provides overviews of recent related works. Section 3 describes our STraNet in details. In Section 4, we first introduce the reannotated dataset. Then, report the experimental results and give the ablation study. Finally, Section 5 presents the conclusions and future works.

II. RELATED WORKS
In this section, we review related works in the context of video captioning, as well as adversarial learning.

A. VIDEO CAPTIONING
Video captioning, which bridges two modalities: vision and language, poses a great challenge for artificial intelligence. Recently, a large number of methods have been proposed for video captioning, where the encoder-decoder architectures have been widely adopted [6]- [8], [10], [12]- [18], [28]- [34]. The encoder learns a high-level representation of the video, then the decoder generates the descriptions word-by-word for the high-level representation. The encoder is commonly a pretrained convolutional neural network (CNN), e.g., VGG19 [35], Inception v4 [36] and DenseNet [37]. The decoder usually adopts a recurrent neural network (RNN), e.g., long short time memory (LSTM) [31] and gated recurrent network (GRU) [38]. With the development of attention mechanism, Xu et al. and Liu et al. adopt a spatial attention to automatically exploit impact of different regions in each frame [6], [8], [10]. A lot of works focus on exploring the temporal information to select the key frames for the current word generation [15], [30], [39]. Considering most works relying on the forward flow (video-to-sentence), Wang et al. refer to the idea of dual learning and propose RecNet [7] to exploit the backward flow (sentence-to-video). On top of captioning system, RecNet stacks another module to reconstruct the visual features.
So far, most image-based captioning methods utilize the inter-tangled frame-level features contributed by all objects via pre-trained models for image classification [1]- [7], [12], [14]- [18], [23], [39], [40]. The holistic features for different frames are independently extracted without temporal association of objects, making it difficult to understand the activities along time. To explore the moving trajectories, some works introduce tracking operations into video captioning. Zhang et al. build the object-aware aggregation with a bidirectional temporal graph (OA-BTG) [19] to track salient objects in the video. Nonetheless, the tracking operation is not precise enough since OA-BTG only tracks some limited preselected salient objects in each frame, which may generate incorrect trajectories. In addition, it performs tracking on the sampled frames rather than the entire video, further resulting in some mistakes. Moreover, it merges all the objects by mixing their visual features without associating the trajectories when fed to the caption generator. Therefore, the mixed features still act as a complement for the frame-level features in OA-BTG.
In addition, natural videos usually contain several concurrent events, it is difficult to reveal much information with only one sentence. To capture multiple activities, Krishna et al. [18] propose dense video captioning which bridges two separate tasks: temporal action location and video captioning. Dense video captioning requires to locate a set of clips where events happen and describe the predicted clips. Despite it can describe multiple events, it is still inadequate to analyze specific object instances. For available short clips, dense video captioning performs the same as traditional video captioning.

B. ADVERSARIAL LEARNING
Generative adversarial learning (GAN) is proposed as an scheme for improving the performance of generative models in [41]. In a GAN, two models are simultaneously trained, i.e., a generative model G to capture the data distribution and a discriminative model D for distinguishing that a sample is from the training data or G. The training procedure for G is to maximize the probability of D making a mistake and it is exactly similarly to a minimax two-player game. To further control the mode of the generated data, conditional generative adversarial nets (CGANs) are proposed in [42]. After adding the conditional labels in a CGAN, the scheme has been applied to more tasks, e.g., image classification and multi-modal learning. Henceforth, the adversarial learning draws significant attentions and applications on imageto-image translation [43]- [47], text-to-image synthesis [48], domain adaptation [49], medical image generation [32], [50], and so on.
Yang et al. [33] propose LSTM-GAN, which introduces the adversarial learning into video captioning. LSTM-GAN directly adopts the general framework of adversarial learning to distinguish whether the generated sentences are real or fake, without modifying the discriminator based on the caption generation task. Whereas, the grammar of the generated caption is much more difficult to distinguish because of the diversity of natural language. Moreover, the generated natural language is in discrete space, different from the image pixel in continuous space, resulting in difficulties during training the discriminator.
Overall, previous video captioning methods fails to provide sufficient and detailed information for object-level analyses. Although significant improvements have been achieved, most methods are data-driven image-based, which cannot reason the step-by-step object-level activities along time.

III. ARCHITECTURE
As illustrated in Fig. 2, our STraNet consists of four modules: (1) Structured trajectory representing: For each input video, we first obtain a set of object trajectories, then extract FIGURE 2. A STraNet consists of four modules: the first builds the high-level structured trajectory representation, the second explores the detailed attributes of the current objects, the third yields the descriptions word-by-word via the attribute-enhanced caption generator, finally the fourth feeds the obtained descriptions as well as the object visual features into the adversarial discriminator to further bridge the vision-language connections and improve the overall performance. The discriminator is only utilized during training, the other three modules are used both in training and inference.
the global features and local features respectively for each object at every frame where the object occurs. Next, to present each object, we construct a high-level structured trajectory representation along the time domain. (2) Attribute exploring: The local features from module (1) are fed to the attribute explorer, where the attribute scores of the objects are utilized for selecting more discriminative features of different objects automatically during the backpropagation.
(3) Attribute-enhanced caption generating: Based on attribute scores from module (2), we generate the attribute-enhanced masks for reducing the vocabulary space dynamically. Next, with the object-oriented structured trajectory representation from module (1), the final word-by-word descriptions for the objects are generated. (4) Adversarial learning: We further design an adversarial discriminator to help in deciding if the generated descriptions are accurate for the given visual contents. In the following subsections, we will address these four modules in details.

A. STRUCTURED TRAJECTORY REPRESENTING
In this subsection, we present how to build the object-oriented structured trajectory representation. With the advances of object-level visual analyses, many excellent works have emerged, such as Faster R-CNN, YOLO, Mask R-CNN for object detection and Tracktor, TrackletNet Tracker(TNT) for multi-object tracking [51]- [55]. Recently, Yang et al. propose MaskTrackRCNN, built upon Mask R-CNN, to perform object detection, instance segmentation and object tracking at the same time. MaskTrackRCNN is trained on natural videos from YouTube, which are similar to our training data, whereas most other tracking methods are mainly trained on scenes for video surveillance or autonomous driving. Therefore, adopting the MaskTrackRCNN as a combined detector-tracker in our work has two salient advantages: (1) MaskTrackRCNN can solve the above mentioned three tasks at the same time, we do not need to do each task separately; (2) MaskTrack-RCNN is trained on the videos which have more similar scenes to our scenarios. This module consists of three steps: first, the detection and tracking results are obtained by the MaskTrackRCNN. Second, with the detection and tracking results, the local features and global features of corresponding objects at different frames are extracted. Third, the highlevel representation via the structured trajectory for each corresponding object in the video is built. In the following, we will introduce the three steps successively.
First, given a T -frame video V = v 1 , . . . , v T , the Mask-TrackRCNN is adopted to obtain the trajectories for all objects as where Second, for each object trajectory, we extract its local and global features as where is a pre-trained neural network for extracting the semantic visual features. We feed each cropped object into to get its local features. The features from the top layers of a neural network contain more semantic information, yet less detailed appearance information, such as color and texture. However, the color information is important for distinguishing different individuals, therefore, we further combine the color histograms into the local features. Finally, the local feature φ t lo of o at t consists of two components: local visual features of objects from the neural network o t and the color histogram vectors c t o as shown in (3). To incorporate the interactions of a tracked object with other objects and the stuff, we again adopt to extract the visual features of the frames where the object occurs as the global features φ t go in (4). Our method is quite different from other works which directly adopt frame-level features. In previous works, the activities are mainly learned based on the frame-level features which are inter-tangled features of all objects and background. However, in our work, with the foreground objects and background being separated, the activities are learned by analyzing the local features of the objects along time. The global features serve as a supplement of the interactions and background. Some works, e.g., the Fine-Grained Spatial Temporal Attention Model (FSTA) [6], also use the results from Mask R-CNN [54] to extract the foreground objects, but the background is completely ignored in their efforts. In fact, background makes up the majority of our visual surroundings, e.g., road, sky, grass, beach, building, and is useful to infer the positions and orientations of the objects, object-object interactions and objectbackground interactions. Therefore, background is crucial for scene understanding. Besides, some works adopt the C3D features as a supplement of local information for learning activities [12], [15], [16], [18], [23]. However, the global features and the C3D features at the same time step, which are jointly fed into an RNN, usually represent the information for different periods of the video. For this reason, the model cannot effectively discover the relationships between the global information and the C3D features. In our work, we extract the features for the frame where the object exists as global features, which are also paired with local features among detected objects, so the network can learn the inherent relationships between local and global features more effectively. To better exploit the connections between objects and the background, we combine the spatial locations of the objects b t o with the joint local and global features φ t o = φ t lo , φ t go . As shown in Fig. 3, our paired local and global features are more informative for learning the interactions between different objects or between objects and background.
Based on the results obtained in the previous step, we further build the high-level structured trajectory representation H t o for each object at time t. H t o consist of the combined feature φ t o and the spatial location b t o . FSTA [6] and OA-BTG [19] use the detector to get effective features for objects as well, yet without combining the spatial information. Actually,  the temporal evolution of spatial locations associated with the object helps a lot in understanding the activities, e.g., jump, walk, squat. Moreover, combing the spatial transition can help for learning activities under limited data as well.
By adopting the structured trajectory representation, we directly utilize the trajectory-sentence pairs for training. It is more effective and specific for learning the activities from the transition of visual features and the evolution of spatial locations. This structured trajectory representation achieves the objective of reasoning activities based on visual features only, without extra temporal cues from other tasks. In Fig. 4, we show an example of our trajectory-sentence pairs. Actually, we have a trajectory-sentence pair for each object in the video. Thus, we have more trajectory-sentence pairs in more complex videos to process all concurrent activities.

B. ATTRIBUTE EXPLORING
Furthermore, to provide more informative and accurate descriptions for different objects and identities, we add an attribute explorer. Actually, the objects which may have activities in most videos can be roughly grouped into three superclasses: human, animal and vehicle. Different from traditional video captioning, we also incorporate some detailed information, such as gender, which is very important in object-level descriptions. Nonetheless, this kind of detailed information cannot be obtained with a standard detector or tracker. To learn more effective and discriminative features as the detailed information of different super-classes, we thus design the attribute explorer (see Fig. 2).
As shown in (5), F p is the attribute explorer, which consists of fully connected layers, and γ o is the obtained attribute score vector. W p and b p separately denote the parameters and bias to be learned. Each dimension of γ o represents the probability of o associated with each predefined class. The categorical cross-entropy is adopted as the loss function L att . Finally, the attribute scores are concatenated with the highlevel structured trajectory representation as the input for the following caption generating module. Thus, for o, the final input of the caption generator at With the attribute explorer, the network can learn more discriminative features for different super-classes, and the following captioning module can thus learn better the connections between vision and natural language.

C. ATTRIBUTE-ENHANCED CAPTION GENERATING
With the results from previous step, we can train a model with the obtained trajectory-sentence pairs shown in Fig. 4. Given an object-oriented trajectory H , the caption generator is required to understand the activities and automatically generate a sentence s = {s 1 , s 2 , . . . , s K } word-by-word in (6), where θ represents the parameters to be learned, and K is the length of the sentence. s 1 , s 2 , . . . , s k−1 denote the generated partial words. During training, the parameters θ * are learned by maximizing the formulation in (7).
An RNN has the capability to decode video contents to sentences. Most works adopt LSTMs or GRUs to generate descriptions [31], [38]. In our scheme, we choose the GRU, which is a good alternative for LSTM, since it has fewer parameters and is easier to converge under less data. Instead of using three gates (input gate, forget gate, output gate) as in an LSTM, a GRU only has two gates: update gate and reset gate. Different from an LSTM using the memory cells to transfer the information, a GRU directly uses the hidden states. The reset gate decides how much past information to forget, and the update gate controls what information to throw away and what information to carry over. In brief, the GRU can be updated by (8), whereφ k and h k−1 denote the current input and the previous hidden state, respectively. In addition we also adopt the temporal attention mechanism to help decide which frames are the key frames for the current word generation and avoid the negative impact of incorrect tracking results (e.g., ID switches in tracking).
Most previous methods end with outputting the generated words created by feeding the outputs of the GRU to the softmax layer. In our STraNet, we further process the outputs from the softmax layer with the attribute-enhanced masks to revise the generated detailed information of the objects. The words in the vocabulary which do not belong to the super-class of the current object will be masked as shown in Fig. 2. In the example, the current object is a girl. Thus, the probabilities of generating the words, which can not describe the girl, e.g., boy and man, are masked to zero.
where softmax(h k ) represents the probabilities of generating all words in the vocabulary, M o is the one-hot attributeenhanced mask for object o generated according to the above process, and is the element-wise product. By the backpropagation, the attribute-enhanced masks can benefit the exploration of the informative features among different objects. Given a target ground truth sequence s * = s * 1 , . . . , s * K , we train the caption generator by minimizing the negative log likelihood in (10). The attribute explorer and the caption generator are trained jointly. Totally, the loss of the generating part is calculated by (11), where λ att is used to balance the attribute explorer,

D. ADVERSARIAL LEARNING
The adversarial learning scheme has facilitated numerous generation tasks [32], [33], [41]- [50]. To further promote learning the vision-language relationships, we build an adversarial discriminator on top of the caption generator. Different from the use of the discriminator in other tasks, we notice that it is difficult to judge if the generated sentences are real or fake for the diversity of language expression. Moreover, in contrary to the image pixels in continuous space, the target of caption generation is to generate natural language in discrete space, which is not conducive for training. Therefore, rather than to decide the generated sentences are real or fake, we add the corresponding visual contents as condition labels, and design the adversarial discriminator to distinguish whether the generated visual words are accurate for the given visual contents, e.g., color descriptions, clothing descriptions, and gender information. The detailed architecture of the discriminator is shown in Fig. 5. We set up four situations for the discriminator to learn vision-language correspondence: where v denotes the local feature of the object; s * is the ground-truth sentence; and G(v) is the generated descriptions from the caption generator. The first two items in (12) are  common losses of a GAN. Referring to negative sampling, we add the latter two items to further allow the discriminator to learn the correspondence between the visual contents and the visual words. We randomly select the visual features for other objects as v un . Learning the positive and negative samples at the same time, the discriminator is equipped with better ability to distinguish whether the generated sentence can describe the visual contents accurately, the performance of the caption generator can be further improved as well. Briefly, our objective is

IV. EXPERIMENTS
In this section, we first introduce the dataset used in our object-oriented video captioning, followed by the implementation details. Next, our experimental results are reported accompanied by the comparisons with other methods. Finally, the detailed ablation studies are presented to analyze the impact of each component in our STraNet.

A. OBJECT-ORIENTED CAPTIONS DATASET
The most widely used datasets for video captioning are the MSR-Video to Text (MSR-VTT) dataset [23] and the Microsoft Video Description (MSVD) dataset [55].
The MSR-VTT dataset contains 10K short video clips and 200K video-sentence pairs, and the MSVD dataset provides 1970 YouTube clips. Moreover, the Activi-tyNet Captions dataset [18], which contains 20K videos from 200 activity classes (e.g., drinking, dancing, playing games), is the most popular benchmark for dense video captioning. We summarize all these datasets in Table 1  However, object-level specific information is not available, for example, which sentence describes which object. Using this kind of data for training cannot effectively learn the functional mappings and the vision-language connections due to the one-to-many nature.
To better learn functional mappings across vision and natural language, we re-annotate a portion of videos from the ActivityNet dataset with explicit object-sentence pairs to construct a new dataset for object-oriented video captioning purpose, named the Object-Oriented Captions dataset. We choose all videos from the class of 'playing games' in the ActivityNet Captions dataset. The videos belonged to this class have more diverse activities and scenes than that from other classes. Also each video contains more diverse individuals and interacting activities. Totally, we re-annotate 75 videos with 534 object-sentence pairs. Each video is of length between 10 seconds to 234 seconds. On average, each video contains about 5 objects which have activities, and the average length of each object trajectory is about 248 frames. Most importantly, the identities of the objects are obtained in our object-oriented video captioning dataset. And there is one sentence associated with each object's activities or motion trajectories in our re-annotation. As shown in Table 1, our data are of object-sentence pairs type which is different from all the other datasets. According to the detailed statistics of our captioning annotations, our sentences contain almost twice many words in each sentence including verbs and adjectives. In MSR-VTT, MSVD and ActivityNet Captions, one description only has less than 1.4 verbs on average, whereas we utilize more than 2 verbs to describe each object in our reannotation. Fine-grained Sports Narrative (FSN) Dataset is a dataset for fine-grained video captioning of Sports Narrative [39], which has more verbs to describe fine-grained actions. Even compared with FSN, our sentences provide richer information. Similarly, we analyze the adjectives in the annotated  We annotate most salient objects in the video. Obviously, our data has more diverse sentences especially in complex scenarios.
sentences. Each sentence in our dataset has about 2 adjectives, yet, the sentences from all the other datasets only have less than 0.67 adjectives. Nonetheless the ratios of adjectives and verbs is not significantly higher than that of the other datasets, the reason is that we have much fewer videos than the others, and thus we can adequately describe the objects using these adjectives. All the above statistical differences show our data are more informative to distinguish the distinct objects and the corresponding activities. In our experiments, we utilize 55 videos for training and 20 videos for test. Although the size of our videos is smaller than those of the existing datasets, it is sufficient for this work. Because we focus on a particular scene of 'playing games' to test the learning capabilities and effectiveness of the proposed model right now, and its generalization to other scenarios can be investigated in the future. Table 2 further shows the constituents of words in the Object-Oriented Captions dataset. Our sentences contain more detailed information for distinguishing different individuals, for example, color of the clothes (common color, plaid, floral), type of clothes (shirt, sweater, pants, shorts, vest, dress, skirt). Fig. 6 shows two examples of our re-annotated data.

B. IMPLEMENTATION DETAILS 1) OBJECT TRAJECTORY PROCESSING
For each trajectory, we sample 40 equally-spaced frames. We adopt VGG19 pre-trained on ImageNet as the backbone to extract the semantic visual features from the last pooling layer. While building the object-oriented structured trajectory representation, we feed the cropped objects into the backbone to get their object-level local features. Meanwhile, the entire frames where the object exists are fed into the backbone to get the corresponding global features. Finally, for each channel of RBG, we extract 16-dimensional color histograms, resulting in the final 4144-dim local features and 4096-dim global features

2) SENTENCE PROCESSING
For the sentences, we remove the punctuations, split them with blank space and convert all words into lower-case. We set the maximum length of each sentence to be 25. The sentences longer than 25 are truncated. We randomly initialize all the word embedding with a fixed-size of 512.

3) TRAINING DETAILS
The attribute explorer consists of three fully-connected layers. Table 3 shows our pre-defined super-classes. For the caption generator, the GRU is initialized to have 2 layers with 1024-dimensional hidden units. For the adversarial generator, we first process the visual features and the embedding of the yielded descriptions from the caption generator or the ground-truth separately. Then, concatenate the two modalities and decide whether they are paired via fully-connected layers. We empirically set the hyper-parameter λ att in (11) to 0.1, and adopt the adaptive moment estimation (Adam) for optimization, with the initialized learning rate being 0.0001 and a mini-batch of 50 object-sentence pairs.

1) COMPARISONS WITH STATE-OF-THE-ART METHODS
We adopt the metrics, BLEU [24], METEOR [27], ROUGE-L [26], CIDEr-D [25], which are widely used in text generation tasks, to quantitatively evaluate the proposed approach.  The higher scores represent better performance of the methods. We compare the performance of our method with the state-of-the-art video captioning methods, MP-LSTM [16], SA-LSTM [15], S2VT [7], RecNet [14], FSTA [6], and OA-BTG [19]. All these traditional methods utilize videosentence pairs during training to generate descriptions for videos. To deal with the proposed object-oriented video captioning tasks, these methods still remain the same general framework, yet directly utilize object-sentence pairs to generate the descriptions for objects.
MP-LSTM is a baseline method relying on the mean pooling to process the frame features. SA-LSTM adopts a temporal attention mechanism to decide the key frames. S2VT adopts LSTM both in the encoder and the decoder. RecNet, FSTA and OA-BTG achieve the state-of-the-art performance of video captioning, however, these data-driven methods cannot perform well under small size of training samples due to its inability of understanding the temporal evolution as shown in Table 4.
FSTA uses the masks generated by Mask R-CNN to filter effective features of objects from the background. However, the features with an explicit boundary are less conducive to training. Therefore, the performance is worse than ours. In addition, it adopts Mask R-CNN as the backbone to extract visual semantic features. However, the visual features extracted from models for object detection pre-trained on COCO have less semantic information compared to the visual features extracted from models for image classification pre-trained on ImageNet. Therefore, FSTA takes more time to converge. For OA-BTG, owing to the errors of tracking operations and the use of large amounts of network parameters, it cannot obtain a satisfactory performance under small datasets as well. The MP-LSTM and SA-LSTM perform better than the other data-driven methods, because they have fewer network parameters and can be trained well under small-samples.
As shown in Table 4, by building the object-oriented structured trajectory representation only, we already obtain better performance than the other methods. With the full structure of our STraNet, we further achieve the highest BLEU@4 score of 21.2. In the proposed STraNet, by representing the moving objects via the structured trajectory representation, we thus can analyze the activities and interactions among objects from the temporal evolution of the visual features and the dynamic movement of the spatial locations, yet not relying on training on large amount of data. Therefore, our STraNet has better ability to understand the activities and attributes under limited data, and effectively learn the cross-modal relationships. From the visualization examples in Fig. 7, it can be seen that the proposed STraNet can generate more accurate descriptions for the activities, like 'draw', 'throw a stone'. Meanwhile, it can describe the interactions between objects and background, and the attributes of objects in details more accurately.

2) ABLATION STUDY
To validate the importance of each component of our proposed method, we perform detailed ablation study. From Table 4, the BLEU@4 score with only building the highlevel structured trajectory representation is 19.3, which is already much higher than the other methods. The significant improvement demonstrates that our object-oriented structure trajectory representation has good ability to express the objects including the activities, attributes, object-object interactions, object-background interactions as well. As shown in Table 5, after adding the attribute explorer, the performance has another great improvement to 20.2. It proves that, with the attribute explorer, the network can learn more discriminative features for different super-classes and describe the objects in details accurately. Comparing the third row and the fourth row in Table 5, we can find that the attribute-enhanced caption generator brings more improvement than the adversarial discriminator. Finally, the full scheme achieves the best performance of 21.2 in terms of BLEU@4, which is the most widely used metric in evaluating the text generation task.
Specially, for each object in the object-oriented video captioning dataset, we only have one ground-truth sentence as reference to compare with. For the rich language expressions, it is difficult to generate the sentences which are exactly the same as the ground-truths, since it is very likely two different sentences can express the same meaning. Therefore, it is more difficult to improve the BLEU score while comparing with only one reference than comparing with multiple references. In our work, we have a significant improvement from 18.3 to 21.2, which shows our structure has good ability to understand the objects and bridge the vision-language connections.
To explore the impact of different components of the object-oriented structured trajectory representation, we further do more detailed ablation study in Table 6. The BLEU@4 performance with only global features is only 16.1, which is worse than all the others. However, the performance with   only local features are 16.6 which accounts for that the explicit features of objects that benefit for learning the relationships between vision and language. After adding the color information, the performance is improved to 18.7. It verifies that our color vectors can effectively solve the problem that the semantic features from top layers of a neural network usually miss most color information. Therefore, combining the color information can facilitate the generating of more accurate descriptions of the attributes. Subsequently, we further combine spatial locations of the objects to represent the moving trajectories more clearly. From the experimental results, it verifies that the temporal evolution of spatial locations works great in understanding activities along time.
Furthermore, we show some visualization results for ablation study in Fig. 8. It can be seen that, after adding the spatial locations, the model can learn more detailed spatial information of the objects, such as generating 'in front of', FIGURE 9. The first is an example of describing an object which is not human. The proposed method can accurately describe the object class 'car', the status of the object 'parking', the spatial location of the object 'across the road' in details. The second example shows our method still can generate accurate description under the inaccurate trajectory owing to the situation of ID switch. 'on the right'. In the second example, the model can further reason the boy 'then back to the camera' from the spatial location evolution. In addition, after combining other modules in our framework, the detailed attributes of the objects becomes more accurate.
We show an example of describing unhuman object in Fig. 9. It can be seen that the proposed method can accurately describe the object 'car', the status of the car 'parking', the location of the car 'across the road'. In the second example, while tracking 'the boy in black', ID switches to 'the girl in pink' due to the object occlusion and camera moving. However, the proposed method can process the inaccurate tracking results and generate correct description for 'the boy in black'. Whereas, if the tracking results are extremely poor, for example, the objects are highly occluded or ID switch occurs too frequently, the model cannot capture accurate visual features, resulting in the overall performance degradation. In Fig. 10, we further show some failed visualization examples. How to improve the detection and tracking algorithm to adapt to natural videos is an important future direction.
Overall, the experimental results indicate that our method can reason the activities more precisely even without temporal cues from action recognition. It demonstrates that our object-oriented structured trajectory representation is effective for expressing the activities of the objects, the object-object interactions, as well as the object-background interactions. Meanwhile, the attribute explorer, the modified attribute-enhanced caption generator, and the adversarial discriminator can significantly improve descriptions of detailed attributes and activities. Moreover, the object-sentence pairs benefit the cross-modal learning. So far, we focus on the scene of 'playing games', however, we believe our framework can be extended to other scenarios. All in all, the proposed method can accurately describe the concurrent objects and their corresponding dynamic attributes in more details.

V. CONCLUSION
In the paper, we propose a novel task of object-oriented video captioning scheme, which transforms the task of traditional video captioning in frame-level to the task of objectlevel analyses. Unlike most previous image-based methods, we propose the object-oriented video captioning framework for more realistic video understanding which can analyze the concurrent activities in details. We design the object-oriented structured trajectory network via adversarial learning, which consists of four components: the structured trajectory representation, the attribute explorer, the attribute-enhanced caption generator, and the adversarial discriminator. The proposed approach can reason the activities, object-object interactions, and object-background interactions, along the time domain based on the spatio-temporal evolutions of visual features. In addition, we modify the discriminator to make the adversarial learning more adapted to the caption generation task, and reinforce the vision-language relationship learning. To the best of our knowledge, this is the first work which proposes to shift the video captioning from generating a holistic description to detailed descriptions for each object. The proposed framework is the first attempt of object-oriented video-based effort, which can understand the activities along time without temporal cues form other tasks, allowing the approach turning into real video understanding. The Object-Oriented Captions dataset is the first dataset which provides explicit object-sentence pairs. With this kind of data, further works can explore the vision-language relationships better. The experimental results demonstrate that our method have significant improvement than the state-of-the-arts for traditional video captioning. The results also indicate that the proposed method can effectively understand the activities and provide richer information of the whole scene and the concurrent objects.
JENQ-NENG HWANG (Fellow, IEEE) received the B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1981 and 1983, respectively, and the Ph.D. degree from the University of Southern California. In Summer 1989, he joined the Department of Electrical and Computer Engineering (ECE), University of Washington, Seattle, where he has been a Full Professor since 1999. He is currently the Associate Chair of the Global Affairs and International Development, ECE Department. He is also a Founder and the Co-Director with the Information Processing Laboratory. He has written more than 350 journal articles, conference papers, and book chapters in machine learning, multimedia signal processing, computer vision, and multimedia system integration and networking. He has authored a textbook His research interests include pattern recognition theory and application, information retrieval, content-based information security, and bioinformatics. VOLUME 8, 2020