Spatio-temporal salient features are widely being used for compact representation of objects and motions in video, especially for event and action recognition. The existing feature extraction methods have two main problems: First, they work in batch mode and mostly use Gaussian (linear) scale-space filtering for multi-scale feature extraction. This linear filtering causes the blurring of the edges and salient motions which should be preserved for robust feature extraction. Second, the environmental motion and ego disturbances (e.g., camera shake) are not usually differentiated. These problems result in the detection of false features no matter which saliency criteria is used. To address these problems, we developed a non-linear (scale-space) filtering approach which prevents both spatial and temporal dislocations. This model can provide a non-linear counterpart of the Laplacian of Gaussian to form the conceptual structure maps from which multi-scale spatio-temporal salient features are extracted. Preliminary evaluation shows promising result with false detection being removed.