Tracking human body poses in monocular video has many important applications. The problem is challenging in realistic scenes due to background clutter, variation in human appearance and self-occlusion. The complexity of pose tracking is further increased when there are multiple people whose bodies may inter-occlude. We proposed a three-stage approach with multi-level state representation that enables a hierarchical estimation of 3D body poses. Our method addresses various issues including automatic initialization, data association, self and inter-occlusion. At the first stage, humans are tracked as foreground blobs and their positions and sizes are coarsely estimated. In the second stage, parts such as face, shoulders and limbs are detected using various cues and the results are combined by a grid-based belief propagation algorithm to infer 2D joint positions. The derived belief maps are used as proposal functions in the third stage to infer the 3D pose using data-driven Markov chain Monte Carlo. Experimental results on several realistic indoor video sequences show that the method is able to track multiple persons during complex movement including sitting and turning movements with self and inter-occlusion.