1. INTRODUCTION
The rapid advances in hardware technology have led to a tremendous increase in the total amount of video content generated and distributed everyday. As a consequence, the need for efficient and advanced methodologies regarding video manipulation emerges as a challenging and imperative issue. To this end, several approaches have been proposed in the literature for tasks like search and organization of video content. More recently, the fundamental principle of processing the audio-visual information from a semantic-oriented perspective has been widely adopted, thus attempting to bridge the so called semantic gap [1] and efficiently capture the underlying semantics of the content.