Skip to Main Content
In this paper, we propose a system for indexing and retrieving player event scenes using multimodal cues in golf videos. Play scenes and audio classes are detected independently from video and audio tracks at the time of indexing. The audio track is semantically segmented into basic audio categories (studio speech, field speech, music, applause, swing sound and others) by means of audio classification and semantic occupation ratios. The visual play-start scene and the excited reaction of the audience are combined to extract event scenes. The player name related to each event is indexed by the spoken descriptors. At retrieval time, the user selects a text query with the player name on the screen. The lists of each query term are retrieved through a description matcher to identify full and partial phrase hits related to event scenes. Experimental results show that the implemented system achieves an average 82.5% accuracy rate.