Skip to Main Content
We describe and evaluate a vision-based technique for tracking many people with a network of stereo camera sensors. The technique requires lightweight local communication and is suitable for crowded scenes where targets are frequently occluded and where appearance based modeling techniques fail. In our approach, multiple stereo sensors individually estimate the 3D trajectories of salient feature points on moving objects. Sensors communicate a subset of their sparse 3D measurements to other sensors with overlapping views. Each sensor fuses 3D measurements from nearby sensors using a particle filter to robustly track nearby objects. We evaluate the technique using the MOTA-MOTP multi-target tracking performance metrics on real data sets with up to 6 people and on challenging simulations of crowds of up to 25 people with uniform appearance. Our method achieves a tracking precision of 10-30 cm in all cases and good tracking accuracy even in crowded scenes of 15 people.