An audio-visual content analysis method is presented, which analyzes both auditory and visual information sources and accounts for their inter-relations and coincidence to extract high-level semantic information. Both shot-based and object-based access to the visual information is employed. Due to the temporal nature of video, time has to be accounted for. Thus, time-constrained video labelling functions are generated. Audio source parsing leads to the extraction of a speaker identity mapping function over time. Visual source parsing results in the extraction of a talking face shot mapping function over time. Integration of the audio and visual mappings constrained by interaction rules leads to more detailed video content descriptions and even partial detection of its context
Published in:
Multimedia Computing and Systems, 1999. IEEE International Conference on
(Volume:1
)
Date of Conference: Jul 1999