Visual hull (VH) construction from silhouette images is a popular method of shape estimation. The method, also known as shape-from-silhouette (SFS), is used in many applications such as non-invasive 3D model acquisition, obstacle avoidance, and more recently human motion tracking and analysis. One of the limitations of SFS, however, is that the approximated shape can be very coarse when there are only a few cameras. In this paper, we propose an algorithm to improve the shape approximation by combining multiple silhouette images captured across time. The improvement is achieved by first estimating the rigid motion between the visual hulls formed at different time instants (visual hull alignment) and then combining them (visual hull refinement) to get a tighter bound on the object's shape. Our algorithm first constructs a representation of the VHs called the bounding edge representation. Utilizing a fundamental property of visual hulls, which states that each bounding edge must touch the object at at least one point, we use multi-view stereo to extract points called colored surface points (CSP) on the surface of the object. These CSPs are then used in a 3D image alignment algorithm to find the 6 DOF rigid motion between two visual hulls. Once the rigid motion across time is known, all of the silhouette images are treated as being captured at the same time instant and the shape of the object is refined. We validate our algorithm on both synthetic and real data and compare it with space carving.