An Adaptive Motion Model for Person Tracking with Instantaneous Head-Pose Features

This letter presents novel behaviour-based tracking of people in low-resolution using instantaneous priors mediated by head-pose. We extend the Kalman Filter to adaptively combine motion information with an instantaneous prior belief about where the person will go based on where they are currently looking. We apply this new method to pedestrian surveillance, using automatically-derived head pose estimates, although the theory is not limited to head-pose priors. We perform a statistical analysis of pedestrian gazing behaviour and demonstrate tracking performance on a set of simulated and real pedestrian observations. We show that by using instantaneous `intentional' priors our algorithm significantly outperforms a standard Kalman Filter on comprehensive test data.


An Adaptive Motion Model for Person Tracking with Instantaneous Head-Pose Features
Rolf H. Baxter, Michael J. V. Leach, Sankha S. Mukherjee, and Neil M. Robertson Abstract-This letter presents novel behaviour-based tracking of people in low-resolution using instantaneous priors mediated by head-pose.We extend the Kalman Filter to adaptively combine motion information with an instantaneous prior belief about where the person will go based on where they are currently looking.We apply this new method to pedestrian surveillance, using automatically-derived head pose estimates, although the theory is not limited to head-pose priors.We perform a statistical analysis of pedestrian gazing behaviour and demonstrate tracking performance on a set of simulated and real pedestrian observations.We show that by using instantaneous 'intentional' priors our algorithm significantly outperforms a standard Kalman Filter on comprehensive test data.Index Terms-Computer vision, context awareness, deep belief networks, head pose estimation, tracking, video signal processing, video surveillance.

I. INTRODUCTION
T RACKING error in the Kalman Filter (KF) increases when rapid changes in target motion occur.In part, this is caused by lag in adjusting the error covariance matrix.In this letter we reduce pedestrian tracking error by combining target velocity with an intentional prior, defined as a prior that predicts rapid changes in target motion.Specifically, we use the control input of the KF to steer the state estimate more forcefully using pedestrian gazing behaviour.
As motivation, consider a scene in which pedestrians exhibit ad-hoc obstacle avoidance (e.g. a goods-vehicle parked on the sidewalk).To model motion, two approaches are available; learning every eventuality (high model complexity), or learning a new (informative) feature.In pedestrian tracking, typical motion can be learnt by using flow vectors and clustering but often requires a strong assumption that motion patterns are stable [2]- [5].Persistent changes can be incorporated over time but ad-hoc trajectories are still typically seen as outliers [6], Fig. 1. (Top) A real person trajectory/head pose behaviour and predicted trajectory using a Kalman Filter (KF) and our intentional tracker (IT).Tracking failures can lead to target data association errors.(Bottom) Frames from the Benfold dataset [1] showing pedestiran head-pose.[7].The resulting models cannot accurately reflect pedestrian response to spatio-temporal context which could cause tracking failure and data-association errors, particularly if occlusions occur (Fig. 1).In such cases an intentional prior (feature) that can predict an ad-hoc change in trajectory is appealing.This theory also generalises to other intentional features: consider a car approaching a crossroads and the indicator light signals intention to turn; contextual knowledge enables better predictions.Several authors have incorporated the concept of 'personal space' and collision avoidance into pedestrian tracking [8], [9].Others have incorporated the idea that socially grouped pedestrians will attempt to stay in close proximity [10], [11].Both concepts represent different intentional priors.In our work we show a generic way of integrating intentional priors into a Kalman Filter and demonstrate performance with a novel head-pose prior.
Our pedestrian tracker takes as input the results of object detection and head-pose estimation.These areas are themselves challenging, especially in the presence of occlusions, camera motion and illumination changes [12].Head-pose estimation is a thriving research topic producing ever increasing accuracy levels: [1] reports an error rate of 24 degrees on real surveillance video.[13]- [15] report similar accuracy and model anatomical constraints using joint body and head-pose estimation.None of this prior work estimates pedestrian position conditioned on head-pose, as we do in this letter.
Robertson and Reid [16] have already shown that head pose can facilitate behaviour explanation in low/medium resolution images.Sankaranarayanan et al. proposed to use head-pose for pedestrian tracking in [17], and presented an algorithm for obtaining high-resolution face images of pedestrians on-the-fly using Pan-Tilt-Zoom cameras.Separate work by Tung et al. [12] and Dee and Hogg [18], [19] consider a target's goal location when making motion prediction, but in all cases rely on learnt goal and trajectory change locations.In contrast, we propose that target motion can be predicted from head-pose.
No prior work has used head-pose to predict pedestrian position and applied it to real video data.In this letter we present for the first time a full derivation and evaluation, following encouraging early work [20].

II. USING HEAD POSE TO PREDICT BEHAVIOUR
We consider the application of pedestrian surveillance and tracking to demonstrate the efficacy of our method.The assumption is that people tend to look where they are going which makes head pose an informative intentional prior for pedestrian targets.Within any tracking paradigm knowing a target's destination is essential for dealing with occlusions and missing detections.We return to head-pose extraction in more detail in Section III-C.
We performed a statistical analysis of pedestrian trajectory and head pose behaviour to validate our hypothesis on three benchmark video datasets: Benfold [1], Caviar [21] and PETS 2007 [22].We used manual annotations of person location (bounding box), head location (bounding box) and head pose direction (angle).We calculated the difference in angle between head-pose and the travel bearing for each pedestrian.For the Caviar and PETS datasets travel bearing was calculated using the bounding boxes for each pedestrian to approximate the location of their feet.These locations were projected to the ground plane using Direct Linear Transformation with point correspondances [23], from which trajectories could be derived for each person.For each point in a trajectory the velocity was calculated and then smoothed by taking the mean of a 24 frame sliding window.
Formally, denote a person's velocity direction at frame as and their head pose direction as .The head pose/direction deviation can then be calculated as the error .The extracted deviations were then analysed to expose their statistical properties which were analytically compared.Mean and variance were extracted for 37 pedestrians from the caviar dataset, 34 pedestrians from the PETS dataset, and 154 pedestrians from the Benfold dataset.
Fig. 2 shows the probability density functions (PDFs) generated using the extracted statistics (underlying histograms were approximately Gaussian).The PDFs show clear support for the intuition that people look where they are going, showing high probability of head pose deviations close to zero.However, there are clear variations in behaviour between the datasets which suggests that any head pose based intentional tracker would need to be optimised for the scene to balance the reliability of the feature.Given these results, we use the remainder of this letter to present our approach for integrating an intentional priors into the KF with a head-pose based implementation.

III. KALMAN FILTER ADAPTATION
We now show how to integrate head pose information into a tracker.Note that although the algorithm is applied to pedestrian tracking our approach remains generic and different intentional priors could be used (e.g. car indicator).As a basis for our tracker we use the KF [24] due to its clear assumptions, wide spread use and efficiency.

A. Kalman Filter Preliminaries
For brevity we only highlight pertinent aspects of the KF (for a thorough introduction see [24], [25]).The KF estimates the state of a discrete-time controlled process governed by the linear equation with measurements (where indicates time).We represent the position and velocity of a target by the state vector , where and represent the target's velocity with respect to its position.and are the process and measurement noise (respectively) and are assumed to be independent and normally distributed with zero mean and covariance and (respectively).
relates the state of the process at to , is the process control input model, is the control vector (set to 1 in the experiments) and H is the observation matrix: (1) That is, we measure target position but not velocity, where a measurement consists of the tuple and is initialised as: .

B. Integrating Intentional Priors
We fuse intentional priors into the KF, firstly, by calculating the strength of the prior, denoted , using the absolute magnitude of the deviations for the last 10 time steps (arbitrarily chosen).This allows to combine both the magnitude and persistence of the prior signal.Rather than using the raw angles, we eliminate small fluctuations in deviation/detection inaccuracies by using a binning procedure to partition the velocity and head pose into 8 bins (numerically numbered 1:8).Each bin represents a 45 sector (see Fig. 3).This procedure allows a smoothed estimate of the head-pose deviation signal (discussed in Section II) to be generated.The signal strength at time is then calculated as follows (where is the head pose direction and is the direction of travel): ( Next, we weight the influence of the prior.Intuitively, the weight ( ) should increase in line with the strength of the prior .A sigmoid function applied to is a simple and effective way to achieve this.The sigmoid parameterised by and and could be optimised for the scene to reflect the reliability of the prior, where adjusts the rate at which the function moves from zero to one and adjusts the 'base-weight (weight given for zero strength).Rather than optimising for any particular scene, we use values for and that were empirically derived in [20] (see Section IV for further details). ( Having determined , the transition model ( ) is adjusted to reduce the influence of the target's previous motion.Denote as the motion model at time and .The motion model is then updated as follows: (4) This has the effect of reducing the influence of and by a factor of during the prediction step of the algorithm.The influence of the intentional prior is asserted using the control matrix : (5) (6) Where is the geometric distance travelled by the target between and is the predicted travel direction based on head pose angle .Two approaches could be used for calculating : It could be estimated from , which is an estimate of the targets velocity given observations .Alternatively, a smoothed velocity could be calculated from , where .In practice the second approach was found to give better performance using empirically derived .Having finally defined all of the components required to generate , the remainder of the KF algorithm remains the same.Predictions are now based on a target's previous motion (with weight ) and the intentional prior (with weight ).

C. Head-Pose Extraction
We validate our approach in a visual surveillance application.Although not the focus of our work, we briefly discuss the novel head-pose extraction procedure used within our validation.We trained a Deep Belief Network [26] using the combined datasets: [1], [21], [27].Heads were binned into 45 poses and reflected in the y axis to reduce dataset bias.The histogram equalised raw image data (cropped head bounding boxes) were scaled to pixels each and provided to a Deep Belief Network parameterised as follows: Number of units per layer; 1024, 400, and 8 (first, second and third layers respectively), dropout rate; 20% (layer 1 only), unit type; rectified linear (layers 1 and 2), softmax (classifier layer (3)).All layers were trained for Epochs each using variable learning rates.For the third layer we used an 80:20 train/test split for the Benfold dataset, and a 50:50 split for the caviar data.The resulting confusion matrices are shown in Fig. 3.

IV. EXPERIMENTS
We compare performance of our tracker against the standard KF (by which we mean having no head-pose information) using the Benfold [1] and Caviar [21] video datasets.To overcome the limited presence of obstacle avoidance behaviours within these datasets (manifested as trajectories with sharp turns), we also compare performance on a corpus of simulated trajectories and hope to obtain additional video examples in the future.Our focus is the development of an intentional tracker so primarily use hand annotated head-poses under the assumtion that object detection/head-pose estimation is provided by the current state-of-the-art.However, as final validation we also report performance using detections from our deep belief network (Deep BN.) head-pose classifier.
Throughout the experiments and , where indicates the n-dimensional identity matrix.We use improvement in mean squared error (MSE) and improvement in cumulative log likelihood (CLL) as our evaluation metrics, where MSE is the sum of squared differences between predicted and real trajectories and improvement is defined as: (IT: intentional tracker).CLL is based on the measurement innovation and is defined as and .Improvement in CLL is: . CLL measures how well the innovation covariance is modelled and is a useful metric when MSE cannot be calculated.We used the optimal sigmoid parameters derived empirically in [20] throughout our experiments ( , ) which gives high weight to the head-pose prior.No further parameter optimisation or tuning was performed for any of the datasets.For both the simulated and real datasets we synthesised pedestrian detection errors at different rates by withholding 0-40% of observations (uniform distribution).

A. Obstacle Avoidance
We consider obstacle avoidance trajectories using a simulated corpus containing 3500 trajectories of 200 time steps (typical track length in a surveillance video).Representative trajectories are shown in Fig. 4 to which Gaussian noise was added to the true target positions ( , ) and to observations ( , ).The direction of travel between and ( ) was used to simulate head-pose direction to which Gaussian distributed noise was also added ( , ).Fig. 4 shows that our approach outperforms the baseline for each trajectory.Performance is degraded by the sharpness of trajectory changes, with worst performance obtained for trajectories ii and v.

B. Annotated Detections
Fig. 5(a) shows performance on the video datasets when using annotated detections.This consisted of person head-pose for the Intentional Tracker and body bounding box for the standard KF.Our approach out performs the standard KF under all conditions.At a detection rate of 60% we maintained improvements of (Benfold) and (Caviar).The video datasets contained fewer challenging (e.g.sharp turn) trajectories than the simulated corpora, but head-pose behaviour was occasionally effected by distractions (e.g.shop windows) making all datasets equally challenging.

C. Real Detections
We next evaluate tracker performance with real head-pose classifications.For the Benfold dataset head detections were provided by a re-implementation of [1].For Caviar the handannotated head-detections we used.For both datasets detected heads were classified using our novel Deep BN. head-pose classifier (Fig. 3).Fig. 5(b) shows that we achieved median improvements of 5.9% on the Benfold data and 15.8% on Caviar.Since there are only 7 examples of sudden trajectory changes in the Benfold dataset (none are occluded), we synthesised occlusions on these trajectories.Specifically, for each change in trajectory we withheld a window of observations from each tracker to occlude the change (see Fig. 1).Table I shows the improvement (i.e.reduction) in mean squared error (MSE) between the predicted and withheld pedestrian observations.A mean reduction of 62.9% was achieved across the 7 trajectories.

V. CONCLUSION
This work has shown that head-pose and direction of travel are well correlated in some environments and we have proposed head pose as a good intentional prior for pedestrian surveillance.Our experimental validation showed that our intentional tracker could significantly outperform the standard KF on both video, and synthetic datasets containing sudden changes in behaviour.
In the future we intend to use contextual information to switch between different intentional priors.

Fig. 2 .
Fig. 2. Probability density functions and associated statistics for head pose deviations extracted from three video datasets.

Fig. 3 .
Fig. 3. Confusion matrices and true positive/false positive rates (TPR/FPR) for our deep belief network head-pose classifier on the Benfold and Caviar datasets.In square brackets: Number of head-pose examples.

Fig. 5 .
Fig. 5. Improvement in Cumulative Log Likelihood (LL) by our intentional tracker a standard KF.(a) Using the simulated, Benfold, & Caviar datasets under three head/body detection rates & hand-annotated head-pose.(b) Using headpose classifications from our deep belief network (Deep BN).

TABLE I PERCENTAGE
IMPROVEMENT (REDUCTION) IN MEAN SQUARED ERROR (MSE) DURING OCCLUSION FOR 7 TRAJECTORIES