Skip to Main Content
We propose a method for describing human activities from video images by tracking human skin regions: facial and hand regions. To detect skin regions robustly, three kinds of probabilistic information are extracted and integrated using Dempster-Shafer theory. The main difficulty in transforming video images into textual descriptions is bridging the semantic gap between them. By associating visual features of head and hand motion with natural language concepts, appropriate syntactic components such as verbs, objects, etc. are determined and translated into natural language.