A Multimodal Framework for Video Caption Generation

Video captioning is a highly challenging computer vision task that automatically describes the video clips using natural language sentences with a clear understanding of the embedded semantics. In this work, a video caption generation framework consisting of discrete wavelet convolutional neural architecture along with multimodal feature attention is proposed. Here global, contextual and temporal features in the video frames are taken into account and separate attention networks are integrated into the visual attention predictor network to capture multiple attentions from these features. These attended features with textual attention are employed in the visual-to-text translator for caption generation. The experiments are conducted on two benchmark video captioning datasets - MSVD and MSR-VTT. The results prove an improved performance of the method with a CIDEr score of 91.7 and 52.2, for the aforementioned datasets, respectively.


I. INTRODUCTION
Video caption generation aims to automatically generate 14 meaningful natural language descriptions about the video. 15 For this, a clear understanding of the semantic details as 16 well as the contextual visual relationship between different 17 objects present in the video is needed. Many algorithms  [7]. 26 Most of the deep learning frameworks employed for 27 caption generation use an encoder-decoder structure. The 28 encoder utilizes a Convolutional Neural Network (CNN) or 29 Recurrent Neural Network (RNN) for extracting the visual 30 and semantic details in the video. It generates a feature 31 The associate editor coordinating the review of this manuscript and approving it for publication was Yeliz Karaca .
vector representation corresponding to the visual content in 32 the video, which is then given to a decoder structure having 33 sequential models such as RNN, Long Short Term Mem-34 ory (LSTM) or Gated Recurrent Unit (GRU) that does the 35 visual to natural language translation [8], [9], [10]. In an 36 early attempt made by Venugopal et al., a novel video-to-text 37 generation methodology is presented, which extends image 38 captioning methods by incorporating a 2D-CNN network 39 along with mean pooling and RNN decoder structure [11]. 40 But this method fails to use the temporal details present in the 41 video for caption generation. Subsequently, an S2VT model is 42 proposed in [12] that uses a stacked LSTM network to learn 43 the temporal information in a sequence of frames and then 44 produce a sequence of words. Later, attention mechanisms are 45 included in the spatial as well as temporal domains to achieve 46 better performance [13], [14]. Video descriptions can also be 47 generated by employing attention in the decoder section as 48 well as using multimodal fusion mechanisms of visual, text 49 and audio features [8], [15]. methods using the evaluation metrics BLEU, METEOR and 109 CIDEr. 110 The paper is organized as follows: Section II gives a 111 brief review of the research works existing in this area. The 112 details regarding the proposed architecture and experimental 113 results are described in Section III and IV, respectively and 114 Section V concludes the paper. 116 In early approaches, caption generation of videos is done 117 using classical template-based techniques that employed the 118 SVO-triplets -Subject (S), Verb (V) and Object (O) [17]. 119 These triplets are found out individually and they are com-120 bined to form a sentence. Many encoder-decoder architec-121 tures have been proposed that use 2DCNN/3DCNN structures 122 as the encoder for generating feature representations and 123 sequential models like RNN, LSTM and GRU as the decoder 124 for language translation [18], [19]. A two-step captioning 125 approach that learns the correspondence between semantic 126 representation labels and verbalization before translating it 127 to natural language is introduced in [20      It employs a 2D-WCNN network having a modified 208 ResNet-50 structure [35] with two-level DWT decomposi-209 tion to provide a better time-frequency representation of the 210 frames. In DWT decomposition, each frame in the video is 211 decomposed into four subbands, which highlight frequency 212 details in the image. Hence the utilization of the DWT 213 pre-processing stage together with the convolutional neural 214 network helps to extract some of the distinctive spectral 215 features that are more predominant in the subband levels of 216 the frames in addition to the spatial, semantic as well as 217 channel details. The detailed structure of 2D-WCNN network 218 is shown in Fig. 2. Each of the input frames, resized to 219 224 × 224, is subjected to two-level multi-resolution decom-    Fig. 3. The object regions are identified using the 255 2D-WCNN network and Region Proposal Network (RPN), 256 similarly as that of Faster RCNN, along with classifier and 257 regression layers for creating bounding boxes. The identified 258 objects are then paired and different subimages are created 259 with each having the identified object pairs. For uniformity, 260 these subimages are resized to 32 × 32 and are given to the 261 CNN layers having two sets of 64 filters, each with receptive 262 field 3 × 3 as shown in Fig. 3. The obtained feature maps 263 highlight the spatial relationship between the object pairs. 264 The spatial relation feature maps of each object pair is then 265 stacked together and are given to 1× 1 × 64 convolution layer 266 to form the contextual spatial relation feature map. These 267 features of the CORE are passed through a fully connected 268 layer to produce a 2048-dimensional feature vector, V c i .

II. LITERATURE SURVEY
where ⊕ denotes concatenation, G att represents global Scaled 292 Dot-product Attention with independent head projection 293 matrices, The attended output features so obtained from the global, The VT att features are fed to a linear network and finally, 343 the prediction of words is performed by the softmax layer. 344 During the training phase, cross-entropy loss L CE from all 345 time steps is used, which is expressed as, where W 1:t−1 represents the ground truth sequence at time-348 step, t and θ denotes the parameters.

349
The model so designed has to be undergone exhaustive 350 evaluation to reveal its effectiveness in video captioning, 351 as discussed below.

353
Both qualitative and quantitative analysis of the proposed 354 framework has been carried out with different datasets and 355 performance evaluation metrics. Results of this analysis and 356 a comparative study with the state-of-the-art video captioning 357 techniques are presented in this section.   with the word embeddings [16], which provides an infor-426 mation regarding the position of the tokens in the sequence.

427
The weight initialization of the model is done using Xavier  to avoid overfitting, dropout and early stopping are used in 436 the method. The self-critical training strategy is employed in 437 the implementation, where the model is trained initially for 438 50 epochs with the cross-entropy loss and it is further fine 439 tuned with 25 epochs using the self-critical loss for achieving 440 the best CD score on validation set. This helps to tackle the 441 exposure bias problem during the optimization with cross-442 entropy loss alone. During the testing phase, BeamSearch 443 strategy is adopted to select the best caption from few selected 444 candidates. The beam size is chosen as 5.     This proves the ability of our method in highlighting the 500 finer details in the input video clip. Table 5 [33] in video cap-509 tioning. For this dataset also, it achieves an impressive B@4, 510 MT and CD scores of about 44.9%, 29.8% and 52.2%, respec-511 tively, which indicates the better performance of our method 512 compared to the existing methods. The inclusion of DWT 513 in the architecture helps to extract the fine visual details 514 present in the video clips more efficiently compared to the 515 other methods. The method extracts three different features 516 from the video for multimodal video representation. Then 517 attention is captured from these three different modalities 518 simultaneously and is combined to acquire all the attentive 519 regions in the video that highlights the underlying video 520 semantics. The textual attention is also interleaved with the 521 aforementioned attention helps to generate captions that are 522 at par with human-generated ones.  Experimental studies were conducted to validate the 536 enhanced performance of the method with the inclusion of 537 global, contextual and temporal features in the video with 538 and without discrete wavelet decomposition. An ablation 539 study is also conducted to analyze the effectiveness of the 540 inclusion of multiple attention stages in the VAP and VTT 541 networks. Table 6 shows the results of B@4 and CD scores 542 92172 VOLUME 10, 2022  contextual information and C3D model for getting temporal 564 details in the video are also done. It achieved a B@4 and CD 565 value of about 89.2% and 50.6%. This enhancement is due to 566 the extraction of spectral information along with the spatial, 567 temporal and semantic details in the input video.

568
Also, experimental study has been conducted for finding 569 out the optimal number of attention blocks with the introduc-570 tion of self critical loss. The results are highlighted in Fig. 5. 571 The system performance seems improved with the number 572 of attention blocks in the VAP-caption decoder stages of the 573 transformer network. But it gets saturated after a particular 574 number of attention blocks. Thus, for the proposed model, the 575 optimal number of attention blocks is found to be 4 with B@4 576 and CD values of about 53.6% and 91.7% for MSVD dataset 577 and 44.9% and 52.2% for MSR-VTT dataset, respectively.

578
The quality of the captions generated by the method is 579 illustrated with few sample video clips from both the datasets 580 as shown in Fig. 6. Here the baseline method is the network 581 without WCNN as mentioned above.

583
Even though the proposed method gives better performance 584 in the reported evaluation metrics, it still has some limitations. 585 VOLUME 10, 2022   Fig. 7, where a motorcyclist is met with 596 an accident by losing his control over the bike and finally 597