OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Automatically understanding and describing the visual content of videos in natural language is a challenging task in computer vision. Existing approaches are often designed to describe single events in a closed-set setting. However, in real-world scenarios, concurrent activities and previously unseen actions may appear in a video. This work presents the OSVidCap, a novel open-set video captioning framework that recognizes and describes, in natural language, concurrent known actions and deal with unknown ones. The OSVidCap is based on the encoder-decoder framework and uses a detection-and-tracking-object-based mechanism followed by a background blurring method to focus on specific targets in a video. Additionally, we employ the TI3D Network with the Extreme Value Machine (EVM), which learns representations and recognizes unknown actions. We evaluate the proposed approach on the benchmark ActivityNet Captions dataset. Also, an enhanced version of the LIRIS human activity dataset was proposed by providing descriptions for each action. We also provide spatial, temporal, and caption annotations for existing unlabeled actions in the dataset - considered unknown actions in our experiments. Experimental results showed our method’s effectiveness in recognizing and describing concurrent actions in natural language and the strong ability to deal with detected unknown activities. Based on these results, we believe that the proposed approach can be potentially helpful for many real-world applications, including human behavior analysis, safety monitoring, and surveillance.


I. INTRODUCTION
Video understanding is a challenging issue in computer vision. It requires sophisticated techniques to process the diversity of humans and objects appearances in different environments and their relationships over time.
The ability to detect and identify specific events is also a critical step towards video understanding. Video events are high-level semantic concepts perceived by humans in a video sequence [1]. Each event is composed of one or more meaningful objective actions, such as walking or jumping, and interaction with objects, such as typing a computer or The associate editor coordinating the review of this manuscript and approving it for publication was Khoa Luu .
handshaking [2]. Each perceived concept consists of an entity (human, object, action, or scene attributes) that occupies a specific position in a frame and may vary in size, color, shape or other specific attributes.
Video description (also called video captioning) is one of the many problems under video understanding. It has become a hot topic in computer vision and deep learning [3] and requires solving many different tasks simultaneously, including object detection and classification, action detection and recognition, and visual relationships among humans and objects. A video description approach may be employed in various applications such as human-robot interaction, video indexing, assistance to the visually impaired, understanding sign language, and video surveillance. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Current deep learning techniques are effective to learn discriminative spatio-temporal features from raw data. They are used to solve several complex tasks, such as object detection and classification [4], human action recognition [5], [6], video summarization [7], semantic image segmentation [8], and video understanding [9]. However, a step beyond the simple categorical classification of actions in scenes is to describe events in a human-comprehensible language. To accomplish this, it is crucial to understand the semantics of a given video scene.
Despite the efforts and progress that have been made in the video description task, it is still an open problem and has attracted much attention [3]. Existing approaches are limited to the fixed list of activities in the training corpus and have focused on generating a holistic description of short-length videos with only one main action happening in the video. However, in practical applications, such as safety monitoring and surveillance, videos may have concurrent activities, and humans can perform many different actions and even create new movements and hand gestures at will.
A more realistic approach is to assume an open-set scenario for describing actions. Open-set classifiers allow performing classification by enclosing each class in the feature space and reserving space for new classes to emerge, unlike closed-set classifiers, which assign infinite spaces to training classes. This strategy allows rejecting data from previously unknown classes instead of wrongly assigning the class label with the highest probability value [10].
Following this idea, a video captioning approach in an open-set scenario can adequately describe known actions and deal with unknown ones. Thus, it is essential to detect if the performed action was seen during the training step to correctly describe known actions or activities and avoid generating wrong descriptions of new detected actions.
Based on that, this work presents a novel open-set video captioning framework that aims to describe, in natural language, not only single but also concurrent events occurring in a video. The proposed approach uses an open-set action recognition model to detect unknown actions, thus avoiding incorrect descriptions and hallucinations. Some recent works have successfully performed video action recognition in an open-set scenario [10], [11]. However, to the best of our knowledge, this is the first time such properties are explored in the video captioning task.
The proposed representation learning approach is based on the encoder-decoder framework and uses a detection-andtracking-object-based mechanism followed by a background blurring method to define the targets and recognize the concurrent actions to be described. Additionally, we employ the Triplet Inflated 3D Neural Network recently proposed by [11], which uses Deep Metric Learning and the Extreme Value Machine (EVM) [12] as the open-set classifier. The main contributions of this paper can be summarized as follows: • We propose a novel video captioning framework to recognize and describe concurrent actions/activities performed by humans in an open-set scenario; • We present a novel open-set mechanism to detect outof-domain videos of unseen activities; • We present extensive experiments and analysis, using 2D and 3D feature representations, demonstrating the effectiveness of our approach. The remainder of this paper is organized as follows. Section II presents a brief description of related works. In Section III, we present the theoretical aspects related to the proposed method for open-set action recognition. In Section IV, we describe in detail the proposed framework. Next, in Section VI, we present the experimental settings, their results, and a discussion. Finally, in Section VII we present the conclusions and suggestions for future research directions.

II. RELATED WORKS
Early proposed methods for the video description task started with template-based methods in which the Subject (S), Verb (V), and Object (O) were detected and then, used in a sentence template [3]. Although these methods could generate descriptions based on grammar, they did not take into account the spatial and temporal associations between entities and suffered from the lack of diversity of generated sentences. Inspired by the rapid development of deep learning techniques in the Computer Vision and Natural Language Processing area, video description research has recently become a hot topic.
The video description approaches based on deep learning methods are mainly designed in the encoder-decoder architecture [3], [13]. The encoder is usually a combination of 2D and/or 3D CNN and LSTM that converts the input into a feature vector representation of fixed length. The decoder is usually an LSTM or GRU that generates a sequence of words.
Pre-trained deep learning models, such as VGGNet [14] or ResNet [15], are commonly used to extract spatial features from frames. These features are usually combined across the frames by an average pooling or max-pooling operation, resulting in a single fixed-length feature vector representation for a short video clip. Besides, the C3D [16] or I3D [17] models, pre-trained in a large dataset such as the Sports-1M dataset [18] or Kinetics dataset [19], are used to extract temporal features. The use of pre-trained models on large datasets provides a strong visual representation of objects, actions, and scenes depicted in the video [20].
Reference [20] proposed the first end-to-end learning approach based on deep neural networks for the video captioning task. A variant of AlexNet pre-trained on a subset of the ImageNet [21] dataset was used to extract visual features from frames. Then, the mean pooling method was employed, resulting in a single vector representing the entire video. Finally, two stacked LSTM was used to generate the sentence.
Since then, many approaches have been proposed to use attention mechanisms to dynamically select spatial and temporal features focusing on important frames and regions inside them, providing meaningful visual evidence for caption generation [22]- [26]. The use of attention mechanism has improved the video captioning task suggesting that the this method can efficiently improves the descriptions, especially in discontinuous videos, by focusing on specific parts of the visual input.
Considering that open-domain videos cover a broad range of topics, such as sports, music, food, and so on, some approaches have been proposed to generate sentences guided by latent topics [27] and semantic attributes [28]. The use of multimodal data, such as visual, audio, motion, and textual information, was also explored in some works [29], [30]. The combination of audio, movement, and visual information has been shown to play an important role in the description generation process.
The dense-captioning events task was proposed by [31] and consists of detecting and identifying all events in a given video and describe them in natural language. Their proposed approach uses DAPs [32] to localize temporal event proposals and a caption module based on LSTM to generate a sentence for each event proposal. Reference [33] also propose a unified end-to-end approach for video dense captioning. However, instead of using RNN for description generation, the authors used Transformers [34]. Their proposed approach is composed of three components: a video encoder, a proposal decoder, and a captioning decoder. The video encoder is composed of multiple self-attention layers. The Temporal Action Proposal (TAP) is based on ProcNets [35], which was designed to detect actions in long videos. Moreover, the captioning decoder module uses Transformers to generate the sentence for each event proposal.
Despite achieving promising results, these approaches often fail to describe concurrent activities happening in a video. Also, the datasets used to evaluate these approaches are created with videos extracted from movies or YouTube videos. Such videos cover a broad range of topics, such as sports, music, food, and so on, and a wide variety of different individual and collective actions performed by humans, animals, and even moving cartoon objects. These videos also present specific challenges, including the presence of discontinuity points between frames, as reported by [36], which may result in inadequate temporal representation features.
Besides the limitations presented above, the lack of welllabeled data is a crucial problem in the deep learning area. The zero-short learning task has been studied to classify actions with no or few examples during the training step [28], [37]. Some approaches have been proposed for visual descriptions task [38], [39] to describe novel objects not presented in paired image sentence dataset. The zero-shot video captioning task [40] focuses on describing out-of-domain of a novel activity without paired captions, but with the knowledge of the activity.
The approaches presented so far assume that all possible classes are already known during the train or test phase. However, new classes emerge as time passes in the real-world dynamic environments. An open-set Human Action Recognition approach requires the classifier to accurately classifies known classes seen during the training stage and deals with unknown classes, which are unseen and with no semantic information provided during the training stage [10]. In this work, we also exploit the nature of the open-set recognition problem to propose a framework to describe videos in an open-set scenario. As previously stated, to the extent of our knowledge, there is a lack of related works in this approach in the literature, being the main original contribution of the present work.

III. THEORETICAL ASPECTS
This Section presents the fundamentals of the methods used in our open-set recognition module: the Extreme Value Machine and the Triplet Inflated 3D Neural Network.

A. THE EXTREME VALUE MACHINE
The Extreme Value Machine (EVM) was initially proposed by [12] to perform open-set classification. In the EVM, the modeling of each class in the training set is based on a set of extreme vectors, which are associated to a Probability of Sample Inclusion ( ).
The key concept of EVMs is the use of margin distributions, which is the distribution of the half margin distances of the training data. In the original formulation, one can consider x i as a training sample and y i the corresponding label. Considering x i and x j , where ∀j, y j = y i , x j can be considered the nearest point to x i and, in this case, the margin estimate for the pair ( The m ij value can be computed for the τ nearest points and the distribution of the margins is estimated with those points using the Extreme Value Theorem (EVT). The EVT states that the minimum values of x i is given by a Weibull distribution [12]. The probability of inclusion for a point x is given by in which x i − x is the distance between x and x i , λ i and κ i are the Weibull's shape and scale parameters. Each is considered an EVT rejection model and (x i , x , κ i , λ i ) corresponds to the probability that a sample is not beyond the negative margin. Even though a sample has zero probability around the margin, the model can also be extended to support soft margins. The probability that a point x belongs to class C l , where l is the class index, is given by Equation 2: (2) VOLUME 9, 2021 Finally, the classification function is: in which δ is a threshold responsible for defining the boundary between known and open-space. In order to reduce the size of the model, many redundant pairs can be discarded with minimal impact on performance. Details of this procedure can be found in [12].

B. TRIPLET INFLATED 3D NEURAL NETWORK (TI3D)
The TI3D is a Deep Metric Learning Neural Network introduced in [11]. It uses the I3D as the base model to build a cosine triplet loss network. The TI3D learns a feature mapping such that intra-class distances are small and inter-class distances are large.
The TI3D takes three inputs: Anchor, Positive, and Negative. For the human action recognition task, the Anchor (a) represents a video of any given action, the Positive (p) represents a video of the same action, and the Negative (n) represents a video of a different action, both w.r.t. the anchor. Given N (a, p, n) triplets, the Triplet loss function L is defined by: Anchor, Positive and Negative embeddings, respectively, α is the margin parameter, and denotes the cosine distance between two vectors x i and x j : Additionally, the symbol + indicates the operator This loss function attempts make the cosine distance between Anchor and Positive samples smaller than the distance between the Anchor and Negative instances by, at least, a margin of α. Alternatively, it will force examples of the same class to be mapped closer than examples of different classes (or even previously unknown examples).
We employ the TI3D with its default parameters and use hard and semi-hard triplet mining, as shown by [11]. Semihard triplets are defined as triplets in which the distance between the Anchor and Positive is smaller than the distance between the Anchor and Negative videos, but this distance is smaller than the margin parameter, i.e., Hard triplets are defined as triplets in which the distance between the Anchor and Positive is larger than the distance between the Anchor and Negative, i.e., (f (x a ), f (x p )) > (f (x a ), f (x n )). This triplet mining strategy ensures that only triplets with a positive loss w.r.t. Eq. 4 are used during training.

IV. METHODS
In this section, we present the OSVidCap framework for video captioning. It consists of five main modules: Target Detection and Localization (TDL), Features extraction, Open set module, Encoder, and Caption Generation. The overall architecture of OSVidCap is presented in Figure 1 and detailed as follows.

A. TDL MODULE
Detecting multiple concurrent events in a given video is essential to describe them in natural language adequately. The Target Detection and Localization (TDL) module consists of a mechanism designed to detect and track significant moving objects in a given video, which are considered the main concepts of the event. The output of this module consists of video segments for each moving object detected with a blurred background.
More specifically, the TDL module detects and tracks humans but is easily adaptable for other moving objects (such as animals and vehicles). We employ the Yolo-v4 [4] to detect humans and track them using the Deep SORT method [41]. The human-human or human-objects interaction is captured when they overlap in consecutive frames. In such cases, the entities are considered a single region of interest in the final video segment.
Finally, inspired by [42], we use a background blur method to guide the sentence generator module to focus on each region of interest in each video segment during the generation of the sentences.

B. FEATURES EXTRACTION
When human actions are described, it is important to consider details of the person, place, and action [43]. Thus, the Encoder module comprises four main classes of features extracted from a given input video as shown in Figure 1. All these features were extracted using off-the-shelf models, pre-trained on large datasets, which proved to be beneficial for video captioning tasks [20], detailed as follows: • Scene type features: A sample of 16 evenly-spaced frames per video was used to extract the max-pooling features from the last convolutional layer using the VGG model pre-trained on the Places365 dataset. 1 The final representation is a 512-dimensional feature vector.
• Spatial Features: For extracting spatial features, we used the ResNet-101 model [15], pre-trained on the Imagenet dataset. From a sample of 16 equally spaced frames, we extracted a 2048-dimensional semantic feature vector of each frame from the last pooling layer. Then, an average pooling operation was performed, resulting in the final feature vector of dimension 2048.
• Temporal Features: The ResNeXt-101 with 3D convolutions [44], pre-trained on the Kinetics dataset [19], was used to extract a 2048-dimensional semantic feature vector for every 16 frames (with 50% of overlap). Then, followed an average pooling to obtain a final vector with 2048 features.
• Human body skeleton features: We used the ST-GCN model [45], pre-trained on the Kinetics dataset, to extract significant complementary information for the spatial and temporal features. This is a graph-based model for modeling dynamic skeletons extracted with the Openpose toolbox [46]. It is aimed to capture motion information in dynamic skeleton sequences. We performed a global max-pooling operation over all skeleton sequences to obtain a single 256-dimension feature vector for a given video. The combination of skeleton features with spatial and temporal features was intended to improve the performance in action recognition and, consequently, in the descriptions of the videos [47]. Except for the scene type features extracted from the original video frames, all other features were computed with the video segment processed by the TDL module. All these features are used in the encoder model to compute the feature final vector representation.

C. OPEN SET MODULE
The TI3D was initialized using the weights of the I3D and trained according to Section III-B. Then, it was used to extract features from both training and test videos. The features are used to train the EVM classifier, which predicts each action in the test set as known or unknown. The output of the module supports the caption generation by signalling whether the action belongs to a known or unknown class.
The TI3D was trained for 20 epochs, updating the triplets every epoch using the hard and semi-hard triplet mining strategy proposed by [11]. The learning rate was set to 0.02, the margin parameter to 0.2, and the batch size to 256. For the EVM, we set the tail size τ to 10% of the number of samples in the train set, the cover threshold for model reduction was set to 0.5, and the probability of inclusion (δ) to 0.5. These parameters were empirically set, based on previous experiments on the LIRIS dataset [48] used in this work.

D. ENCODER
This block aims to derive a feature vector representing the essential concepts to predict the next word for describing the ongoing action in the video. All the previous features extracted from the video were mapped into a common highlevel abstract space by a feedforward network (FCN) with ReLU activations, as depicted in Figure 1.
Before Features Fusion (FF) step, we fuse the output processed by the Open Action Recognition Module with the processed Temporal Features (F tp ) to consider the unknown action information. Notice that the processed Place-type features (F p ), Spatial features (F sp ), and Human body skeleton features (F sk ) were remained to preserve essential information for caption generation, such as information about the place-type and number of people detected in the scene.
The output calculation of the encoder module provided by the FF can be formulated as follows: in which W 1 , W 2 , W 3 , and W 4 are weight matrices; U p , U sp , U sk , and U tp are features from the input modules: scene type, spatial, human body skeleton, and temporal, respectively; b 1 , b 2 , b 3 , and b 4 are the bias vectors; denotes the ReLU activation function; ⊗ denotes elementwise multiplication operator; * is the convolution; is the concatenation operator; and O uk denotes the feature vector provided by the TDL module.

E. CAPTION GENERATION
This module consists of the sentence generation and uses two Long Short-Term Memory (LSTM), a variant of Recurrent Neural Network (RNN), which works better with longterm dependencies. The first LSTM encodes the preceding VOLUME 9, 2021 sequence of words S = s 0 , s 1 , . . . , s t−1 . The second LSTM predicts the next word based on the output of the first LSTM combined with visual features computed by the Encoder module. The LSTM calculation formula used in this work is given by the following equations: in which U g and W g are weight matrices; x t is the input at time t; h t−1 is the previous state; and f t , i t , and o t are the forget, input and output gates, respectively. The calculations of unit gates are: in which U f , U i , U o , W f , W i , and W o are weight matrices, b f , b i and b o are bias vectors, and σ denotes the sigmoid activation function.

V. DATASET
There are a few datasets publicly available for video captioning task [3]. The most used datasets in the literature are MSVD [49] and MSR-VTT [50], containing a wide variety of open domain short videos. Each video has only a single main activity and multiple sentences with different details describing the video. Despite the availability of annotated datasets for the video captioning task, none of them contain specific information about the action performed in each video, such as an action categorization. This information is essential in detecting and recognizing known and unknown events in an open-set scenario. Also, they do not contain concurrent events happening in the same video.
To overcome the above-mentioned limitations, we improved the LIRIS human activities dataset with captions and temporal annotations of new actions. Furthermore, we evaluate the generalization of our method on the largescale ActivityNet Captions dataset. Both datasets are detailed as follows and are made available for further studies. 2

A. LIRIS CAPTIONS DATASET
It was designed for recognizing complex and realistic actions in videos and made available for the ICPR-HARL'2012 competition. The full dataset contains 828 actions (including discussing, telephone calls, giving an item, etc.) performed by 21 different people in 10 different classes. Each action performed in a video contains spatial annotations in a bounding box and temporal information (the beginning and end of action). It was organized into two independent subsets: the D1 subset, with depth and grayscale images, and the 2 http://labic.utfpr.edu.br/datasets/UTFPR-OSVidCAP.html D2 subset, with color images. The dataset also has unannotated actions, such as walking, running, whiteboard writing, book leafing, etc.
In this work we used the D2 subset that contains 367 annotated actions from 167 videos. Each action consists of one or more people performing one or more different activities. Besides, we extract 116 video segments in 15 different unannotated actions from the original videos to be used as unknown classes. Each new video segment was also annotated with spatial, temporal, and description information.
Reference [51] suggested that the number of reference sentences directly affects the accuracy of automated metrics. Also, those authors affirm that using five sentences models obtain a substantial boost in performance compared with only one sentence. Following this work, we improved the LIRIS human activity dataset with five different descriptions for each action, as shown in Figure 2.

B. ACTIVITYNET CAPTIONS DATASET
The ActivityNet Captions dataset [31] is a large dataset proposed for dense-captioning events, which involves both detecting and describing events in a video.
It contains 20,000 videos split into around 50%, 25%, 25% for training, validation, and testing set, respectively. All videos were taken from the ActivityNet Dataset [52], a benchmark for video classification and detection, which covers 200 classes of activities. The dataset also has an overlap of 10% of the temporal descriptions, thus indicating the presence of concurrent events. Each video is annotated with a series of temporally localized descriptions.
Although the ActivityNet Captions dataset is available for download as a collection of Youtube video links, many of these videos are no longer available for download, as reported in previous works [53], and only the pre-computed C3D features provided by the authors are not helpful in our experiments. Thus, we used 12,714 videos that were still available for download. Videos shorter than 3 seconds were disregarded due to the small number of extracted frames. As our approach focused on describing entire videos and not detecting a series of events, we used the ground-truth event proposals to extract 34,934 video clips for each temporarily localized description provided in the annotations.
While ActivityNet Captions was originally designed for video dense captioning, we adapt it to our task by including action annotations to evaluate the generality of the proposed method in a large-scale dataset. Due to the considerable effort required to annotate each video clip manually, these annotations were collected from the ActivityNet dataset based on the video name, which is the same in both datasets. Each resulted action class contains, on average, 114 videos for training and 55 videos for testing. The action annotations were used to split videos into known and unknown classes for the detection of known and unknown actions.

VI. EXPERIMENTS A. IMPLEMENTATION DETAILS
The proposed OSVidCap framework uses an encoder-decoder architecture. Therefore, both the encoder and caption generation modules (decoder) were trained in an end-to-end way. Before training, all captions were tokenized and converted to lowercase. Sparse words occurring less than three times in the training set were replaced with the unknown token. The fasttext [54] word embedding pre-trained on the Common Crawl Corpus was used to embed features into a 300-dimensional feature vector. It provides much more powerful and effective low-dimensional word representations for video captioning than other techniques such as sparse one-hot encoding vectors [55].
During the training step, a begin-of-sentence and endof-sentence token were added to the sentence to deal with varying lengths. Also, an unknown tag was used to replace sparse words. We input the begin-of-sentence token into our Caption Generation Module to start the description generation process during the test step. Then, previously generated words are used as input to produce the following words until the max sentence length or the end-of-sentence token is achieved. In our experiments, the max sentence length was set as 19 and 25 for the Liris dataset and ActivityNet Captions dataset, respectively. Zero padding is applied if the sentence is shorter than the max number of words. The Beam Search method was employed to select the best sentence and avoid local optima. In our experiments, the beam size k was set to 3.
We empirically set the hidden state LSTM with 512 units and applied dropout with a rate of 0.5 on the input and output of the LSTM. The Adam algorithm, with a learning rate of 5 × 10 −5 was used for optimization. The cross-entropy loss was used to train our model. All experiments were implemented using Tensorflow and Keras library.
To demonstrate the effectiveness of the proposed method, we have conducted two experiments to analyze the influence of the open set module and compare the video caption performance with related works.

1) EXPERIMENTS ON THE LIRIS DATASET
Due to the small number of videos and known actions in the Liris dataset, we performed a 5-fold cross-validation procedure to assess the OSVidCap performance. The same training and testing set of each cross-validation fold was used to train the open set module. In addition, to evaluate the effectiveness of the proposed approach in detecting unknown events, we include in the testing set 116 videos with unknown actions as described in Section V-A.

2) EXPERIMENTS ON THE ActivityNet CAPTIONS DATASET
The OSVidCap performance to generate captions of known events was performed using the standard data split. 3 Since this dataset was made available as a challenge, the test set was not provided with the ground truth. Thus, we follow the previous works [33], [53] and report the results on the validation set. The effectiveness of the proposed approach in detecting unknown events was performed using a 5-fold cross-validation procedure. Each fold contains known videos of 40 actions for the training and testing set, as explained in Section V-B. We also included in the testing set v r random videos from other classes as unknown actions. The v r was defined as the same number of videos presented in the training set to avoid imbalanced data.

B. EVALUATION METRICS
The captions generated by the proposed framework were evaluated according some metrics, frequently used in the area: BLEU [56], METEOR [57], ROUGE-L [58], and CIDEr [51]. All metrics were computed using the COCOcaption API [59].
BLEU is a metric based on n-grams precision modified and measures the predicted sentence proximity with one or more reference descriptions. Following most previous works for video captioning [3], we used four-grams with the BLEU metric, which is referred as BLEU-4. METEOR is based on the precision, recall, and harmonic mean and consists of creating an alignment between uni-grams from candidate and reference sentences. The word matching supports morphological variants including stemming and synonyms. CIDEr is a consensus-based metric and measures the similarity of a generated sentence against a majority of a set of ground-truth sentences. It employs morphological variations by changing each word in their stem (or root form) to resolve word-level correspondences. ROUGE-L computes the recall and precision scores using the longest common subsequences (LCS) technique and tends to reward long sentences with high recall. In our experiments, BLEU, METEOR, and ROUGE metrics were normalized to range from 0 to 100, with 100 as identical to the reference sentence. CIDEr ranges from 0 to 1000, with 1000 as identical to the reference.

C. QUANTITATIVE RESULTS
In this section, the performance evaluation of the proposed method is presented and compared with two recent existing approaches.
SGN [60] exploits the use of semantic groups based on meanings such as people, objects, or actions, rather than frame by frame for understanding a video. It is comprised of four main components: (i) a Visual Encoder component that aims to extract visual features from video frames; (ii) a Phrase Encoder which produces phrase representations from words by using the self-attention mechanism; (iii) a Semantic Grouping which employs a semantic aligner to align the video frames with phrases; and (iv) a Decoder based on LSTM with temporal attention.
Non-Autoregressive Coarse to-Fine (NACF) model [61] proposes a coarse-to-fine captioning procedure using a bi-directional self-attention-based network as caption generator. For improving caption quality, the decoder method is decomposed into two stages. First, a coarse-grained ''template'' is generated. Then, dedicated decoding algorithms generate fine-grained descriptions by filling in the generated ''template'' with suitable words and modifying inappropriate phrasing via iterative refinement.
For a fair comparison, all the methods utilize the ResNet-101 and ResNext-101 features as input, and the reported results were obtained using Microsoft COCO caption evaluation tool [59]. Furthermore, all approaches were set with the same maximum sentence length and minimum word frequency during training. Table 1 presents a comparison performance of the OSVid-Cap with existing approaches on LIRIS dataset. It can be noticed that our model OSVidCap (S+T) achieved better performance in terms of Rouge-L and CIDEr and competitive performance in terms of Bleu and Meteor. Also, our model OSVidCap (S+T+SK+P) surpasses the compared approaches by 4.9% of BLEU-4, 5.1% of METEOR, 4.3% of ROUGE-L, and 9.3% of CIDEr. This suggests that our approach can better describe concurrent events in videos. In addition to spatial (S) and temporal (T) features, the model considered Human body skeleton (SK) extracted from human movements and Place-Type (P) features extracted from places. This points out that specialized features can be essential to better describe similar actions or actions according to the context (place). Such feature enrichment provides essential information to distinguish some actions, such as shaking hands and giving a small item to a second person. Also, the place type gives meaningful semantic information, as some actions tend to happen in specific places. Table 2 presents the video captioning comparison on Activ-ityNet Captions dataset. It can be noticed that the proposed approach also achieved better or competitive results across all metrics, showing robust generalization to other contexts and scenarios. It is also noteworthy that the values of the  metrics presented in Table 2 are significantly lower than those presented in Table 1 due to the complexity of the datasets, as reported in section V. The performance reported on this dataset is similar to those reported in recent literature [53], [62]. Note that, despite having used the same dataset to report the results, they are not comparable with the presented approach, as the videos and features used for training, validation, and testing are different.  In both datasets, the use of Place-type features did not show significant improvements. This may indicate that previously used features can also describe this visual information or are irrelevant for the video description task.
In Table 3, one can observe the evaluation performance of the open-set module in detecting known and unknown actions on the Liris Dataset. Results are presented in a 5-fold cross-validation procedure. The proposed method achieved satisfactory results in detecting known and unknown classes with an average F1-Score of 86.2%. Table 4 shows the evaluation performance of the openset module in detecting known and unknown actions on the ActivityNet Captions dataset. Five experiments with different numbers of the known classes in a cross-validation procedure were performed. The proposed method achieved satisfactory results in detecting known and unknown classes with an average F1-Score of 79.80% when ten classes were considered as known actions.
In Table 4, it can also be seen that the average precision of the unknown class is about 9% higher than the known class, and the average recall of the known class is 13% higher than the unknown class. This shows that the proposed approach achieves better results in detecting unknown classes than known classes. The automatic annotation process of video actions on the ActivityNet Captions dataset, as described in section V-B, also produced some annotation noises during the training and testing process. These noises can be a performed action with a different label or even a video without human actions. Figure 3 depicts an example of a video presented in the dataset. It can be observed that the video has different events with different start and end times. The automatic annotation process set the action class ''Removing ice from car'' to all video clips. However, in this example, only two video clips are related to the annotated action. Therefore, the degradation in the average precision metric of the known class may have been caused by the presence of these annotation noises. When considering new actions as known classes, the average F1-Score decreased due to the cumulative annotation errors provided by the automatized annotation process, as reported below. Table 5 reports the impact of the open-set component on the video descriptions generated by the proposed approach. The results reported in the Liris dataset used the same data in a cross-validation procedure, as used in Table 3. For reporting the results on ActivityNet Captions Dataset, we used the 5-fold cross-validation applied in Table 4.
These results are significantly higher when compared with those reported in Tables 1 and 2 because, in this experiment, we considered videos in the test set with unknown activities. For these videos, the model is supposed to generate descriptions such as ''a person is performing an unknown action''.
The experiments with unknown actions in the testing set suggested that Place-type features did not lead to a significant improvement. However, these features are important to understand scenes in which the information about the place type is relevant, for example, to describe whether the person is entering or leaving an office or writing a whiteboard in a classroom. In the testing set used to report the experiments in Table 5, several videos from unknown classes were included to evaluate the proposed open set module. Therefore, the overall influence of the Place-type features has quantitatively decreased due to the small number of sentences that require such features. To the best of our knowledge, this is the first work to address the video captioning task in an open-set world by generating captions of known events present in the training set and dealing with unknown events not previously seen.

D. QUALITATIVE RESULTS
In Figure 4, we illustrate three examples of video descriptions generated by the baselines method SGN and NACF and the proposed OSVidCap. Figure 4a depicts a scene with two sequential actions. First, a man in a striped t-shirt talks to VOLUME 9, 2021   a woman in front of a whiteboard. Then, another man in a black t-shirt enters the room and gives an item to the man in a striped t-shirt. Figure 4b shows two concurrent events. While a man and a woman are handshaking, another man is leaving baggage unattended. Finally, in Figure 4c, three events take place in the video. At the same time, a man is performing an unknown action. Another man leaves an item in the letterbox cabinet and then enters the room.
For the examples of Figure 4, our approach described concurrent actions better than the baselines. In Figure 4a, the OSVidCap correctly described the ongoing action but wrongly represented the color of the t-shirt, suggesting that the model did not learn this information from the input features. Possibly, more specific features should fix this issue.
In Figure 4b, we can observe that the compared approaches could not detect the shake hands action, suggesting the importance of using human body features in describing human action videos. Also, they fail to detect and describe concurrent actions in videos.
We can realize the importance of the open set module in the situation considered in Figure 4c). While the OSVidCap detected an unknown action performed by a man and correctly described it as such, the compared approaches generated a wrong description. It is worth highlighting that this action was previously labeled as unknown and did not appear in the training set.

VII. CONCLUSION AND FUTURE WORKS
The majority of artificial intelligence methods rely on the closed-set world assumption. The same holds for the specific case of automatic video captioning systems. Existing methods based on a closed-set world can adequately describe only the temporal events previously seen during the training step. Unless they are trained with all existing events and actions of interest, they will not be able to recognize unknown events found in videos in the wild. Furthermore, most current approaches for video description focus only in single actions occurring at a time, while in the real-world, concurrent events may take place. To address the above-mentioned issues, in this paper we proposed the OSVidCap framework, that can detect and describe concurrent events in an openset world scenario. From a given input video, the TDL module detects and tracks humans and outputs a set of video segments to be described. Then, spatial and temporal features are extracted from each video segment. Also, the open set module, built upon the TI3D metric learning approach coupled with an extreme value machine (EVM), classifies each detected action as a known or unknown class. Then, the Encoder module computes the features and generate a fixed-length vector that represents the whole video content. Finally, the caption generation module, based on the LSTM, generates the descriptions in a human-comprehensible form.
Experimental results demonstrate the effectiveness of the framework in describing concurrent events in a given video.
Also, the open-set module allows the framework to describe unknown events. Our experiments also show that different features such as the Human body skeleton and Place-type features are quite relevant to understand fine-grained actions, frequently performed in specific environments. Such features enrichment provides a better video representation for generating a more detailed description. Furthermore, due to the lack of specific datasets for evaluating concurrent events in an open-set scenario, we have contributed new annotations of unknown actions in the LIRIS human activity dataset that can be used as a benchmark for the proposed task.
Despite the excellent results achieved by OSVidCap, we observed that it could provide a more detailed description of people, for instance, including the type and color of the clothes. This enrichment of details can plays an important role for applications in surveillance. The TDL module is capable of capturing individual humans or objects of interest and simple interactions between them, by capturing the overlapping region among objects. However, the proposed module may fail to capture more complex human interactions.
Therefore, in future work, the proposed framework will be evolved by enriching the description of people in the scene, as well as to improve the detection of events involving persons that interact at a distance, such as watching TV or throwing an object to another person. Another future work involves providing a human evaluation over a subset of testing data, as existing metrics used for automatic evaluation of video captioning may not properly correlate with human judgment.