Skip to Main Content
A novel video representation, the layered dynamic texture (LDT), is proposed. The LDT is a generative model, which represents a video as a collection of stochastic layers of different appearance and dynamics. Each layer is modeled as a temporal texture sampled from a different linear dynamical system. The LDT model includes these systems, a collection of hidden layer assignment variables (which control the assignment of pixels to layers), and a Markov random field prior on these variables (which encourages smooth segmentations). An EM algorithm is derived for maximum-likelihood estimation of the model parameters from a training video. It is shown that exact inference is intractable, a problem which is addressed by the introduction of two approximate inference procedures: a Gibbs sampler and a computationally efficient variational approximation. The trade-off between the quality of the two approximations and their complexity is studied experimentally. The ability of the LDT to segment videos into layers of coherent appearance and dynamics is also evaluated, on both synthetic and natural videos. These experiments show that the model possesses an ability to group regions of globally homogeneous, but locally heterogeneous, stochastic dynamics currently unparalleled in the literature.