We describe a method of segmenting musical audio into structural sections based on a hierarchical labeling of spectral features. Frames of audio are first labeled as belonging to one of a number of discrete states using a hidden Markov model trained on the features. Histograms of neighboring frames are then clustered into segment-types representing distinct distributions of states, using a clustering algorithm in which temporal continuity is expressed as a set of constraints modeled by a hidden Markov random field. We give experimental results which show that in many cases the resulting segmentations correspond well to conventional notions of musical form. We show further how the constrained clustering approach can easily be extended to include prior musical knowledge, input from other machine approaches, or semi-supervision.