Sensor fusion for object tracking has become an active research direction during the past few years. But how to do it in a robust and principled way is still an open problem. In this paper, we propose a new fusion framework that combines both the bottom-up and top-down approaches to probabilistically fuse multiple sensing modalities. At the lower level, individual vision and audio trackers are designed to generate effective proposals for the fuser. At the higher level, the fuser performs reliable tracking by verifying hypotheses over multiple likelihood models from multiple cues. Unlike traditional fusion algorithms, the proposed framework is a closed-loop system where the fuser and trackers coordinate their tracking information. Furthermore, to handle nonstationary situations, the proposed framework evaluates the performance of the individual trackers and dynamically updates their object states. We present a real-time speaker tracking system based on the proposed framework by fusing object contour, color and sound source location. We report robust tracking results.