Skip to Main Content
This work examines the possibility of exploiting, for the purpose of video segmentation to scenes, semantic information coming from the analysis of the visual modality. This information, in contrast to the low-level visual features typically used in previous approaches, is obtained by application of trained visual concept detectors such as those developed and evaluated as part of the TRECVID High-Level Feature Extraction Task. A large number of non-binary detectors is used for defining a high dimensional semantic space. In this space, each shot is represented by the vector of detector confidence scores, and the similarity of two shots is evaluated by defining an appropriate shot semantic similarity measure. Evaluation of the proposed approach is performed on two test datasets, using baseline concept detectors trained on a dataset completely different from the test ones. The results show that the use of such semantic information, which we term "visual soft semantics'', contributes to improved video decomposition to scenes.