Skip to Main Content
Summary form only given as follows. In this paper the focus is on the second main channel of information, human pose and gesture. We present an architecture for a real-time, high precision stereo vision system to find humans, track the movement of their heads and hands and offer this data to other devices that are connected to the local area network. The system uses industry standard hardware and is capable of tracking several humans with more then 20 frames per second. It integrates into TCP/IP based PC networks and is scalable to more then two cameras. The concept is to create an area that is able to perceive what is happening in it, build a model of itself and reason about the actions of its inhabitants. Adaptive background separation is used to separate the shape of the human. Color histograms are used to identify skin areas. The detected pixels are clustered and the clusters are described by their mean and covariance matrix. Experimental results show that the system can track human hands and heads in 3D with a precision better then 10 cm. It has successfully been used to generate maps for robots from the movements of humans.