Tencent AVS: A Holistic Ads Video Dataset for Multi-Modal Scene Segmentation

Temporal video segmentation and classification have been advanced greatly by public benchmarks in recent years. However, such research still mainly focuses on human actions, failing to describe videos in a holistic view. In addition, previous research tends to pay much attention to visual information yet ignores the multi-modal nature of videos. To fill this gap, we construct the Tencent ‘Ads Video Segmentation’ (TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. TAVS describes videos from three independent perspectives as ‘presentation form’, ‘place’, and ‘style’, and contains rich multi-modal information such as video, audio, and text. TAVS is organized hierarchically in semantic aspects for comprehensive temporal video segmentation with three levels of categories for multi-label classification, e.g., ‘place’ - ‘working place’ - ‘office’. Therefore, TAVS is distinguished from previous temporal segmentation datasets due to its multi-modal information, holistic view of categories, and hierarchical granularities. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels. Accompanied with TAVS, we also present a strong multi-modal video segmentation baseline coupled with multi-label class prediction. Extensive experiments are conducted to evaluate our proposed method as well as existing representative methods to reveal key challenges of our dataset TAVS.


Introduction
With the remarkable progress in action recognition approaches [4,7,17,18,52,61,63,64,68] driven by large-scale benchmarks [9,22,49] in recent years, the de facto trend of video understanding is to understand human actions by capturing appearance and motion cues.However, few researchers will question whether a single perspective as human action is enough to understand videos given their diverse semantic aspects and multi-modal nature.An important reason to limit comprehensive understanding of videos * Equal contribution.† Corresponding author.
is the lack of proper benchmarks that incorporate a representative domain of videos with rich semantics beyond actions and build a comprehensive hierarchy of categories.
Recently, some initial attempts have been made in this direction such as Holistic Video Understanding (HVU) [15] and MovieNet [26] to overcome the above limitations.HVU collects more annotations from six semantic aspects, e.g., 'attribute' and 'concept', from existing action recognition datasets [9,21,31,53,71].It tries to enrich the diversity of video recognition by assigning multiple labels to each video, but it lacks the important temporal analysis process and multi-modal information because it inherits the characteristics of previous datasets.MovieNet further overcomes more limitations of temporal analysis, multi-modal information and categories from multiple aspects by collecting long-form movies.Yet, it still has coarse-level defined segments and insufficient aspects of categories: 1) Its temporal analysis is based on shots detected by traditional methods [10] and no accurate temporal annotation is provided, making the temporal segmentation problem a sequential binary classification problem.Thus, the ambiguity of temporal analysis makes it less practical for real-world applications.2) MovieNet only adds 'place' classes into categories, which are even less important than actions for video understanding.
Ideally, we can say that we 'understand' a video only if we correctly decompose a video into semantics-consistent parts, recognize semantics of each part, and recover the content of the whole story with another explainable form, e.g., a set of labels.To explore a good representative domain for such high-level intelligence, we believe that ads video, as its properties of rich plot developments and multi-modal nature are just like a 'short movie', is a proper arena for video understanding algorithms.In this work, we present the Tencent 'Ads Video Segmentation' (TAVS) dataset.In addition to previous datasets which focus on recognition tasks, e.g., actions or places, we try to incorporate more high-level and abstract classes, e.g., classes from the perspective of video making, to make a step towards cognition tasks and examine algorithms' ability of learning abstract concepts.For example, we have 7 classes of relationships, e.g., lovers or friends, 6 classes of themes, e.g., family or teaching, and 4 classes of production forms, e.g., animation or cross-cutting.They will be correctly classified only if the algorithms really learn the plot developments across shots among a long-term content in videos.To accommodate the multi-modal nature of videos, e.g., videos, audios, and texts (Fig. 2(a)), we also include some classes for modalities other than vision, e.g., dubbing (Fig. 2(b)).As a result, a three-level class hierarchy from three perspectives 'presentation form', 'style', and 'place' is built to facilitate the research of video understanding.
Based on the granularity of our categories, we also define the concept of scene as super-shot in the semantic level with the corresponding granularity in temporal segments.A scene commonly contains one or more consecutive shots and conveys high-level semantics, e.g., the first scene in Fig. 1 consists of 4 shots, shifting between the couples in the street and the customer service.Yet, all the 4 shots are about making a phone call, so they are grouped into a scene.By detecting scene boundaries, an effective temporal structure for understanding ads videos is obtained.To make the temporal boundary unambiguous, we also annotate accurate scene boundaries and make our dataset more effective for downstream applications, such as video summarization or editing.Based on the well-defined temporal segments and categories, we believe that our proposed new setting of scene segmentation task, which requires algorithms to temporally segment videos into scenes accurately and predict scene-level multi-label classification results, is a good representative for multi-modal video understanding.
Motivated by the need of the scene segmentation task with multi-modal information, we propose a novel endto-end 'Multi-modal Scene Segmentation Network' (MM-SSN) for simultaneous temporal segmentation and multilabel classification.MM-SSN incorporates multiple prediction heads as dense scene boundary classification, scene boundary refinement, and multi-label classification.It overcomes limitations of previous methods in temporal action segmentation/detection or scene segmentation.The multimodal feature fusion, temporal modeling, and prediction heads in MM-SSN are integrated to be an end-to-end effective framework for the scene segmentation task.
In summary, our main contributions are three-fold: 1) We propose a new setting of scene segmentation task for understanding realistically difficult ads videos with multimodal information.2) We build a new benchmark dubbed TAVS.It provides high-quality, temporally accurate, and multi-label annotations with a comprehensive class hierarchy from three perspectives.3) To reveal key challenges of TAVS, we propose an end-to-end multi-modal video segmentation model and choose proper metrics to evaluate the performances of our proposed baseline and prior representative methods.We conduct extensive ablation studies and would like to facilitate future research on multi-modal video understanding.

Video Representation Learning
Single-modal Video Classification.Video classification is usually investigated in the form of action recognition with single modality.Early efforts including KTH [48], Weizmann [8], UCF-101 [53], and HMDB [31] commonly have a limited number of categories about human actions.To perform large-scale video representation learning and also serve as a pre-training for downstream tasks, datasets with automatically generated noisy YouTube tags [1,29] or accurate human annotations [9] are constructed for action recognition.Recently, fine-grained action categories with a hierarchy are proposed to reduce the scene bias from backgrounds for action recognition [20,49].However, their categories (even those with a class hierarchy) are still limited to human actions and lack the holistic view of videos, such as the style and presentation form of videos.They also ignore the multi-modal information and have no temporal annotations, so ours dataset is different from them.Multi-label Video Classification.Previous datasets are mainly resulted from untrimmed videos with concurrent or temporally independent actions, such as Multi-MiT [44], Charades [51], ActivityNet [22], MPII-Cooking [46], and MultiTHUMOS [69].The noisy annotations generated from YouTube tags [1,29] also result in some videos with multiple classes.Our dataset has accurate multi-label annotations from three perspectives to achieve a holistic view to understand ads videos, therefore different from the previous ones.The most similar multi-label dataset to ours in the video classification area is HVU [15] with six different tasks for classification.However, it does not provide multi-modal contents and temporal annotations for video segments, so we address a different need from it.Multi-modal Video Representation Learning.Early efforts for supervised multi-modal video classification mainly explore the usage of rich web data with image, video, audio, and text and construct their own datasets [28,38,39,59], yet there are few public benchmarks for evaluation.Some other attempts exploit audio or text information from public video classification benchmarks to enrich video representations [66,67].Recently, large-scale unsupervised multi-modal video datasets without categories are proposed to learn better representations for downstream tasks [33,43,56], which show superior performance over supervised counterparts on many downstream tasks, e.g., action recognition [3,42], action segmentation [42,73], and video retrieval [3,19,42,73].Our dataset provides human annotated class labels and temporal boundaries for the video segmentation task, so it is different from the previous datasets.

Temporal Video Segmentation
Temporal Action Segmentation Datasets.These datasets commonly contain instructional videos [30,55,57] with human performing fine-grained actions and require algorithms to predict an action label for each frame with only visual information.Our dataset differs from them by providing rich multi-modal information and requiring multi-label predic-tions for each video segment.Scene Segmentation Datasets.Its major difference from action segmentation is the shot-based boundary detection setting, which means that videos are split into shots in the process of dataset construction, and the shot-based boundary detection task is actually transformed into a sequential binary classification problem, e.g., whether each shot boundary is a scene boundary.Typical datasets are about short videos [47] or documentaries [5], and recently scaled up to long-form movies [45].They commonly contain multi-modal information, such as place, actor, action, and audio, but they lack the important category hierarchy for classifying each segment in order to comprehensively understand videos and also lack the accurate annotations of segment boundaries.Therefore, this task is different from ours and is not applicable to our task, either.Temporal Modeling in Video Segmentation.Scene segmentation methods [45] often use an LSTM [24] with shotlevel multi-modal inputs to predict the binary classification result for each shot boundary as scene boundary.This type of temporal modeling cannot be applied to our dataset with frame-level temporal annotations.Action segmentation methods typically use temporal modeling networks such as encoder-decoder [32,34] or dilated convolutions [16] upon off-the-shelf feature extractors [9] to predict a single action for each frame.Recent approaches [27,65] also incorporate action boundary detection methods to handle over-segmentation errors.Our proposed end-to-end baseline is built upon MS-TCN++ [35] in action segmentation by adding the multi-modal fusion module to handle multiple modalities as input and the extra prediction heads to account for multi-label classification, scene boundary classification, and scene boundary refinement, respectively.

The TAVS Benchmark
In this section, we will introduce our Tencent Ads Video Segmentation (TAVS) benchmark.To make it solid, we collect large-scale ads videos from real industry data and annotate scene boundaries and class labels with careful quality control.We also examine existing metrics and choose proper ones to evaluate the performance on our benchmark.

Ads Video Segmentation Dataset
Based on our defined 'scene' in Sec. 1, we build our large-scale TAVS dataset with accurate annotations for scene boundaries and scene-level multi-label categories.Data Collection.Our ads videos are collected from real industry data, where their diversity is manually controlled according to the domains, such as education, e-business, video game, finance, and Internet service.Different from previous cooking/instructional actions in the action segmentation task or movies in the scene segmentation task, we believe that ads videos from various industrial domains will provide more diverse scenarios to test the performances of algorithms, e.g., people moving or talking in the education domain and artificial scenes in video games.Taxonomy Generation.By carefully inspecting the patterns of collected ads videos, we believe that ads videos are as 'short' movies since they were also filmed from the script and edited by editors.Thus, their key characteristics could be summarized from advertiser-uploaded scripts.By exploiting the statistics of words' frequencies in scripts, we summarize hundreds of key words by deleting trivial words from high-frequency words.Then, we use K-means [41] to cluster them and get several centers like 'place', 'presentation', 'character', and 'style' to be the class hierarchy v1.0.Based on domain knowledge of making ads videos, we further add more classes in the production process, e.g., shooting forms such as 'interview' or 'multi-person sitcom' for filming original videos and generation forms such as 'crossing cutting' or 'slide show' for editing ads videos.For abstract concepts in 'style', we also add classes of emotions to let them complete.As a result, we build a three-layer hierarchy from three perspectives: 'presentation form', 'style' and 'place'.After removing the classes with a very small number of instances (i.e., ≤ 100) when we finish the annotation process, our TAVS dataset has 82 high-frequency classes, as shown in Fig. 3.The numbers of categories with the presentation, style, and place are 25, 34, and 23, respectively.Fig. 4 shows the label distribution of presentation, style, and place with different colors, which is a typical long-tailed distribution.Please refer to Appendix B.2 for the complete list of our categories.Annotation Process and Quality Control.Dissimilar to the existing action localization or segmentation task whose main challenge in the annotation process is the ambiguity of action boundaries, human-edited ads videos have many shot changes as clear demarcations.Therefore, the main challenge of our dataset is the annotations of our multi-label fine-grained categories.Specifically, a scene has about 6 classes on average, so annotators easily miss some classes instead of annotating a wrong label.Based on these concerns, we write a detailed handbook for annotators, which contains an elaborated detailed description and a sample video for each class, and ask them to pay attention to the easily-missed classes.Our handbook also describes when scene changes and in which situation the scene crosses multiple shots.In our annotation process, we first let 7 annotation companies try to annotate 100 videos and select top 3 companies to annotate our entire dataset according to their annotation accuracies.During annotation, we randomly sample about 1/10 videos for each batch and check the quality of annotations, especially the missing classes.We approve the batch if the accuracy of annotation is satisfying (e.g., ≥ 90%), and otherwise let the annotators revise their annotations of the whole batch.Each video's annotations will be revised for about 3 times on average.We also adopt heuristic rules, e.g., mutually exclusive classes should not appear in the same video, to examine the quality.As a result, we believe that the quality of our dataset is carefully controlled.Finally, as scene boundaries annotated by humans are hard to reach per-frame precision, we use a shot detection method TransNet v2 [54] to refine shifting of scene boundaries as post-processing, i.e., we use the nearest shot boundary to substitute the original scene boundary annotation if their distance is small (i.e., ≤ 0.1s).Please refer to Appendix B.1 for visualization of our annotation tool and our guidebook for annotators.Access to The TAVS Dataset.Our dataset will be made publicly available once our paper is accepted.We will release annotations of the train/val set and all 12,000 videos, and maintain an online server for evaluating the performance on the test set.

Dataset Statistics
Ads videos in the TAVS dataset come from real industry data.It contains 12,000 Ads videos (totally 142.1 hours) and is divided into 5,000 for the training set, 2,000 for the validation set, and 5,000 for the test set.Most of video durations in TAVS are between 25 and 60 seconds, with average length as 45.5 sec.In Fig. 4, per-class data size shows that TAVS has a typical long-tailed distribution, indicating the realistic difficulty of our dataset.As shown in Table 1, previous multi-modal scene segmentation datasets commonly have no class annotations (e.g., MovieScenes [45]), while TAVS has.Previous single-modal action segmentation datasets commonly have no shot changes in videos and only require single-label predictions, and most of them are smaller than TAVS (e.g., 50Salads [55] and Breakfast [30]).Furthermore, our dataset covers a wider range of real-world scenarios (e.g., city street or office) instead of a very small domain in action segmentation (e.g., cooking).Thus, our class hierarchy is more diverse and richer than previous  ones.In Table 2, we compare TAVS with other multilabel video recognition datasets.The scale of most previous datasets is much smaller than ours, while HVU [15] has no temporal annotations and no multi-modal information.
MovieNet [26] has more categories than ours, but it merely contains 'action' and 'place' classes while our TAVS has many high-level abstract classes.The scale of MovieNet is also smaller than ours.

Metrics
Due to the properties of our scene segmentation task, e.g., multi-label in the temporal segmentation setting, many metrics adopted by existing tasks cannot directly be used in our task.Here we first briefly review some related metrics and then introduce the metrics we adopted.Metrics of previous tasks.The most related action segmentation task [16,32,65] adopts per-frame accuracy, editdistance based score, and F1-score with different tIoUs.However, the multi-label annotation of our task makes these metrics hard to compute and unsuitable to use.Previous scene segmentation methods [45] without scene-level class annotations use mAP of scene boundaries as their metric.However, they commonly first use a shot detection method [10] and regard the scene segmentation task as a binary classification task to determine whether each shot boundary is a scene boundary or not.Therefore, it is also unsuitable for our accurate scene boundary annotations with 0.04s (per-frame annotation in 25fps videos) as minimum units.Metrics of our TAVS task.To measure the performance of our TAVS task, we first adopt the mAP@tIoU metric similar to temporal action detection task (e.g., Activi-tyNet [22]) to evaluate the performance of multi-label clas-sification and IoU-based localization.We also adopt the F1@distance metric similar to the GEBD challenge [50] to evaluate the accuracy of scene boundary localization solely.Specifically, for mAP@tIoU, we require the submitted scene segments to have no overlaps to adapt the detection metrics [22] to our scene segmentation task.We take an average of mAPs from IoU 0.5 to 0.95 with stride 0.05 as our final mAP evaluation.For F1@time, we use absolute distances instead of relative ones in [50] to avoid the influence from long segments to their neighboring short segments.Formally, the average mAP we use is where N is the total number of categories, and AP is the Average Precision commonly used in detection tasks.To measure the accuracy of scene boundary localization, we use F1-score from 0.1s to 0.5s, named as Avg F1.For computing F1-score in each temporal distance, we match each predicted boundary to the nearest ground-truth scene boundary and regard it as a True Positive (TP) if their distance is less than t, otherwise it is a False Positive (FP).
The ground-truth scene boundary will be deleted once it is positively matched (i.e., ≤ t), and the ground-truths which do not have any positive matches to be False Negatives (FN).Therefore, the precision is TP/(TP+FP) and recall is TP/(TP+FN).The final Avg F1 is computed by

Dataset Characteristics
As discussed above, our proposed new setting of scene segmentation task and the TAVS benchmark have several distinguishing characteristics compared against existing tasks or datasets.Difficulty.TAVS is difficult in several aspects comparing to previous related datasets: 1) Multi-label classification from a holistic hierarchy, while previous ones are limited to only have multiple overlapped actions [69] or no temporal annotations [15]; 2) rich multi-modal information as video frame, audio, text from ASR and OCR, while previous ones commonly contain single visual modality.The realistic difficulty of TAVS will be a good arena for algorithms to understand multi-modal videos as well as our real world.High Quality.Our videos are collected from real industry data and are with high-resolutions.With our careful quality control, the class labels and temporal segments are accurately annotated.In addition, compared to MovieNet [26] which only releases three frames per shot, the richer content of original videos will be released in our TAVS dataset.Real-world Applications.Due to the realistically difficult ads videos in our TAVS dataset, it a great potential for applications.For example, 1) Creative video editing: by understanding the detailed content of ads videos from different perspectives as in our TAVS dataset, a single ads video can be automatically split into segments and re-organized to have various lengths, which will save the efforts of video editors; 2) recommendation: the fine-grained categories of each video segment rather than the coarse video-level categories can better serve ads recommendation.Please refer to Appendix C.1 for the discussion of our future work.

Multi-modal Scene Segmentation Network
Based on the discussion on existing scene segmentation or action segmentation methods in videos (Sec.2.2), we find that no existing methods are suitable for our proposed multi-modal ads video segmentation task.Thus, we propose our baseline model 'Multi-modal Scene Segmentation Network' (MM-SSN) for multi-modal and multi-label video segmentation, as shown in Fig. 5. Our model exploits three modalities in our TAVS dataset and performs multi-modal fusion operations to integrate all the information.Then, a temporal modeling network with a similar architecture to MS-TCN++ [35] is used to capture long-term dependencies.At last, each location with integrated long-term information is used to perform three predictions: multi-label classification, scene segmentation, and scene boundary regression.Feature Extractors.We use three off-the-shelf feature extractors for modalities in our MM-SSN: images with Swin Transformer [40], audio with VGGish [23], and text with BERT [14].Due to the fact that either our categories or the contents of our videos have little connection with motions, the performance of the pretrained video feature backbone S3D [68] on the HowTo100M dataset [43] is even worse than the strong image backbone Swin Transformer trained on the large-scale ImageNet dataset [13].For better computational efficiency, we choose the image feature in 2fps instead of video feature in our MM-SSN.Texts in ads videos are extracted by ASR [70] with noisy timestamps.We approximate these texts to 2fps images according to their timestamps and use zero-padding in those moments without people talking.Multi-modal Fusion.As the importance of each modality is not equal, we adopt a two-stage fusion strategy for three modalities: 1) for well-aligned frames and audio features with accurate timestamps, we first concatenate frame features T × D 1 and audio features T × D 2 as T × (D 1 + D 2 ), and then perform a channel attention operation similar to SE [25] to let salient features among two modalities automatically be chosen; 2) the salient features with positional embeddings (PE) will be fused with text features (also with PE) by cross-attention [62].Then, the fused feature is concatenated with the salient feature in Stage 1 to form a final multi-modal feature.Temporal Modeling.To capture the long-term dependen-

Modal Fusion
Temporal Modeling cies of videos, we adopt MS-TCN++ [35] to greatly enlarge the receptive field of our model by dilated convolutions.Specifically, MS-TCN++ has N stages and each stage has L dilated convolutional layers with dilation r = 2 l , where l is the number of the current layer, and N and L are hyperparameters.For the first stage, MS-TCN++ has extra L layers with dilation being r = 2 L−l to simultaneously capture neighborhood and remote information.
Prediction Heads and Loss Functions.Previous temporal action segmentation methods such as MS-TCN++ [35] commonly utilize a cross-entropy loss for frame-wise classification and a smoothing loss for regularization, respectively.Thus, the boundary of each predicted segment is decided according to the transition of classification results, where no explicit boundary localization is performed.To overcome the challenge of multi-label classification in our dataset, we disregard the common practice in action segmentation tasks and develop three heads for predicting segment boundaries and their multi-label categories.Before performing segment boundary localization and regression, we adopt a temporal difference module in adjacent frames to make features more salient for segment boundary localization.A sequential binary cross-entropy is used to classify whether each sampled location is a scene boundary.The accurate ground-truth is approximated into a 0-1 sequence according to the sampling rate (e.g., 2 fps).We also use a boundary regression head to tackle the tiny temporal errors by predicting the offsets from the nearest sampled location to the ground-truth timestamp by a smooth-L 1 loss.For classification head, we adopt the ASL loss [6] on the output of the temporal modeling module for multi-label classification, which has promising performance on long-tailed distributions and is thus suitable for our dataset.We use it with multiple temporal scales: we first use all sampled locations, and then we temporally perform max-pooling of predicted classification confidence scores inside each scene (determined by ground-truth) or inside the whole video and use ASL for scene-level or video-level classification to further boost the performance.Please refer to Appendix A.1 for the implementation details of our MM-SSN.

Experiments
We comprehensively investigate representative methods on the related tasks and present ablation studies of our proposed MM-SSN on our TAVS benchmark.Results are reported in the test set by selecting the best model on the validation set.

Scene Segmentation Results
In this section, we compare our proposed MM-SSN with representative single-modal temporal action detection or segmentation methods, and multi-modal scene segmentation methods on our TAVS dataset.Please refer to Appendix A.2 for implementation details.Temporal Action Detection: BMN [37] + NeXtVLAD [36].BMN is a representative and effective temporal action detection method to generate action proposals, thus we choose it to predict segment boundaries by replacing the 'start' and 'end' classes with only one 'boundary' class.Then, the segments will be classified into multi-label categories by a representative video classification method NeXtVLAD.Temporal Action Segmentation: MS-TCN++ [35] is a representative temporal action segmentation method to perform frame-wise single-label action classification.We simply replace the cross-entropy loss with a multi-label classification loss, i.e., Binary Cross Entropy (BCE) loss, and use a threshold of 0.5 to generate segments.We use a simple  post-processing strategy to merge labels for the same segment.
Since there are shot changes in our TAVS dataset, we can also adopt a traditional scene segmentation method to predict scenes and then perform multi-label classification for each scene.
LGSS is a representative method for scene segmentation by using multi-modal LSTM to aggregate shots (resulted from shot detection [10]) with a sequential binary classification loss.

Comparison of Results:
We adopt the same feature encoders for fair comparisons with our MM-SSN: 1) single modality: Swin Transformer for BMN, NeXtVLAD, and MS-TCN++; 2) multiple modalities: Swin Transformer, VGGish, and BERT for LGSS.In Tab. 3, we compare the performance of MM-SSN to these methods.Even with an independent multi-label classification network, BMN still generates much worse proposals than our MM-SSN, as shown in F1 scores.We believe the reason is that its 2D-proposal map limits the number of its fixed-length sampling, so the localization errors tend to be larger than 0.5s.MS-TCN++ also achieves a poor performance on our dataset because its segment boundary localization is performed by the transition of classification results, which is not suitable for the studied multi-label segmentation problem.Due to the inaccurate shot detector, LGSS has much worse performance than MM-SSN and also has more complex multi-stage architectures.

Ablation Study
How important is each modality?As a multi-modal dataset, modalities are important for the final performance.As shown in Tab. 4, video frame is the most important modality as expected, while all three modalities could further improve the overall performance of mAP by 1%.We believe that it is a considerable improvement based on the strong performance of our baseline architecture with optimized prediction heads and loss functions.Which category is more challenging?As shown in Fig. 4, our dataset has a long-tailed distribution, which makes tail classes even more challenging.In Fig. 6, we perform a per-class scene-level AP analysis (tIoU 0.5-0.95) by comparing MM-SSN with its single-modal variants (only video frames).We observe that the performance of classes is not strictly aligned with class sizes.Tail classes like 'studio' and classes closely related to other modalities like 'dubbing' tend to benefit more from multi-modal inputs.
Multi-modal Fusion Strategies.In Tab. 5, we compare some variants with our two-stage fusion strategy: 1) simple concatenation; 2) SE for all modalities; 3) SE for video and audio features, and then self-attention for fusing the text feature by concatenating it along the temporal dimension; 4) SE and dual-direction cross-attention; 5) SE and singledirection cross-attention.Although their results are close to each other, our fusion strategy still gets a small performance gain, especially in mAP.Temporal Modeling.In Tab. 6, we compare representative temporal modeling techniques: transformer encoder achieves the best results in F1@0.5, but it makes features less discriminative and therefore the result of Avg mAP is much lower.Bi-LSTM achieves the worst result due to its poor ability to handle long sequences in videos.Scene Boundary Localization Strategies.We design two components to enhance the ability of accurate scene boundary localization.As shown in Tab. 7, the boundary regression head greatly contributes to higher F1-score, while the feature difference module also leads to a notable performance gain in F1@0.1s.In Tab. 8, we also investigate the influence of sampling rate: 5 fps leads to a considerable improvement in F1 yet a little decrese in mAP.For both high mAP and high computational efficiency, we choose 2 fps as default.
Classification Losses.Due to the long-tailed distribution of our dataset, directly using a multi-label classification loss, such as the BCE loss, will lead to sub-optimal performance.We compare the recent ASL loss [6] with BCE: the ASL loss obtains a notable performance gain in mAP as about 1.5% and almost the same results in F1 score.

Conclusion
Our goal of this work is to drive the research of video understanding in novel directions by recommending a new scenario of scene segmentation task for study to the computer vision community.Accompanied with a specially collected Tencent Ads Video Segmentation (TAVS) benchmark, we would believe that the proposed task on this TAVS benchmark can lead to expected or unexpected innovations.We reveal the key challenges in TAVS and would like to provide useful guidance for future research.We would also hope that the proposed task and the TAVS dataset will catalyze deeper research in video-related areas, leading to exciting new breakthroughs.

Figure 1 .Figure 2 .
Figure 1.Visualization of a typical ads video in the TAVS dataset.Our new setting of scene segmentation requires precisely segmenting videos as scenes and predicting scene-level multi-label categories.

Figure 3 .
Figure 3.The taxonomy tree of the TAVS dataset.

b r o a d c a s t p a d d in g m u lt ip la y e r s it c o m d u b b in g p la c e -o th e r jo y p h o n e & c o m p u te r c lo s e -u p e x tr e m e c lo s e -u p d y n a m ic in d o o r s u r p r is e h o m e p a r e n t-c h il d p a n o r a m ic s h o t o u td o o r a n g e r fa m il y te a c h e r z o o m in te a c h in g m a te r ia l fr ie n d s & c o ll e a g u e s s c e n a r io in te r p r e ta ti o n s a d c o u p le s p a s s e r b y o ff ic e h a n d w r it in g s tu d io c u r ta in w o r k p la c e c r o s s in g c u t c o u r s e w a r e g r a d u a l tr a n s it io n s c h o o l g r id z o o m o u t s u p e r io r k n o w le d g e e x p la n a ti o n s in g le s it c o m a n im a ti oFigure 4 .
Figure 4. Statistics of per class data size.The three perspectives of our multi-label classification are in different colors.

Figure 5 .
Figure 5.The architecture of our proposed multi-modal baseline.

b r o a d c a s t p a d d in g m u lt ip la y e r s it c o m d u b b in g p la c e -o th e r jo y p h o n e & c o m p u te r c lo s e -u p e x tr e m e c lo s e -u p d y n a m ic in d o o r s u r p r is e h o m e p a r e n t-c h il d p a n o r a m ic s h o t o u td o o r a n g e r fa m il y te a c h e r z o o m in te a c h in g m a te r ia l fr ie n d s & c o ll e a g u e s s c e n a r io in te r p r e ta ti o n s a d c o u p le s p a s s e r b y o ff ic e h a n d w r it in g s tu d io c u r ta in w o r k p la c e c r o s s in g c u t c o u r s e w a r e g r a d u a l tr a n s it io n s c h o o l g r id z o o m o u t b o s s k n o w le d g e e x p la n a ti o n s in g le s it c o m a n im a ti o n in -c a r r e la ti v e s tr a n s it io n p a g e lo n g s h o t s li d e s s id e w a lk r e d e n v e lo p e c o m m o d it y d is p la y ta le n t s h o w im a g e -t e x t fl a s h in g r e s ta u r a n t d is g u s t p a in ti n g c it y s tr e e t d o o r w a y g a m e h ig h li g h t e n tr e p r e n e u r in te r v ie w p a r k m u lt i b r o a d c a s t c it y v ie w m a ll p a r k in g lo t h o s p it a l a n ti q u it y a n c ie n t c o s tu m e s tu d io o u ts
id e th e h o u s e m o th e r -i n -l a w c a fe d e li v e r y m a n s k y a is le in s p ir a ti o n a l s to r y d o c to r lo b

Table 1 .
Comparison with video segmentation datasets.Action segmentation datasets annotate per-frame action classes, but they commonly have no multi-modal information.Scene segmentation datasets only annotate scene boundaries but no scene-level classes.TAVS has rich multi-modal inputs and holistic scene-level categories.† means using ASR during annotation, yet no multi-modal inputs.

Table 2 .
Comparison with multi-label video recognition datasets.

Table 3 .
Figure 6.Per-class scene-level AP analysis by comparing our multi-modal baseline with its single-modal variants on the TAVS benchmark.Comparison with previous representative methods on related tasks.† not originally designed for our task.

Table 4 .
Ablation on modalities used in our baseline.

Table 5 .
Ablation on fusion strategies.

Table 6 .
Ablation on temporal modeling methods.

Table 8 .
Ablation on the sampling rate of video frames.