This paper explores the use of multisensory information fusion technique with dynamic Bayesian networks (DBNs) for modeling and understanding the temporal behaviors of facial expressions in image sequences. Our approach to the facial expression understanding lies in a probabilistic framework by integrating the DBNs with the facial action units (AUs) from psychological view. The DBNs provide a coherent and unified hierarchical probabilistic framework to represent spatial and temporal information related to facial expressions, and to actively select the most informative visual cues from the available information to minimize the ambiguity in recognition. The recognition of facial expressions is accomplished by fusing not only from the current visual observations, but also from the previous visual evidences. Consequently, the recognition becomes more robust and accurate through modeling the temporal behavior of facial expressions. Experimental results demonstrate that our approach is more admissible for facial expression analysis in image sequences.