This paper describes a basic architecture for retrieving images previously extracted from video files. Our approach is made up of two main subsystems: the speech-based retrieval module and the image-based retrieval module. The aim of the experiments presented in this work is to establish a baseline approach to resolve the automatic video image retrieval task, making use of the speech content transcripts and the key frames extracted from video files. The main conclusion indicates that the use of fusion strategies by merging the text and visual data of the queries works better than those approaches that use only the textual part or visual part of the queries separately. Nevertheless, the results obtained confirm that in a content-based IR system it is more desirable to give more weight to the documents retrieved by the speech-based IR subsystem than those retrieved by the image-based IR subsystem.