We investigate a solution to the problem of multi-sensor scene understanding by formulating it in the framework of Bayesian model selection and structure inference. Humans robustly associate multimodal data as appropriate, but previous modelling work has focused largely on optimal fusion, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multi-sensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Such explicit inference of multimodal data association is also of intrinsic interest for higher level understanding of multisensory data. We illustrate this using a probabilistic implementation of data association in a multi-party audio-visual scenario, where unsupervised learning and structure inference is used to automatically segment, associate and track individual subjects in audiovisual sequences. Indeed, the structure inference based framework introduced in this work provides the theoretical foundation needed to satisfactorily explain many confounding results in human psychophysics experiments involving multimodal cue integration and association.