1. Introduction
Surveillance videos are crucial and indispensable for public security. In recent years, various surveillance-video-oriented tasks have been widely studied, e.g., anomaly detection, anomalous/human action recognition, etc. However, the existing surveillance video datasets [20], [25], [26], [38] just provide the category labels and timing of anomalous events, and require all categories to be predefined. Thus, the related methods are still limited to detecting and classifying predefined events merely, lacking the semantic understanding capacity of video content. However, automatic understanding of surveillance video content is crucial to enhance the existing investigative measures. Some surveillance applications often need to search for specific event queries rather than board categories, i.e., using queries to retrieve events in surveillance videos. Meanwhile, intelligent surveillance exhibits a trend toward multimodal directions, especially in video-and-text interaction.
Annotation examples in our UCA dataset, including fine-grained sentence queries and the corresponding timing.