I. Introduction
Human can understand the world in a single glance, in which the recognition ability does step from different aspect perceiving. To recognize an object, people firstly notice the shape and contour profile of the object, then care about its appearance and texture information [1]. Both shape and appearance help people to construct the concept of this object in brain. While for scene distinguishing, it includes not only shape and appearance information, but also the object relative locations in the whole picture and so on [2]. For video event analysis, it needs much more clues whose comprehensive effects figure out the event concept. How to combine different clues becomes a big problem in video/image processing, which is also important to data fusing in multi-sensor network.