Learning Methods for Dynamic Topic Modeling in Automated Behavior Analysis

Semisupervised and unsupervised systems provide operators with invaluable support and can tremendously reduce the operators’ load. In the light of the necessity to process large volumes of video data and provide autonomous decisions, this paper proposes new learning algorithms for activity analysis in video. The activities and behaviors are described by a dynamic topic model. Two novel learning algorithms based on the expectation maximization approach and variational Bayes inference are proposed. Theoretical derivations of the posterior estimates of model parameters are given. The designed learning algorithms are compared with the Gibbs sampling inference scheme introduced earlier in the literature. A detailed comparison of the learning algorithms is presented on real video data. We also propose an anomaly localization procedure, elegantly embedded in the topic modeling framework. It is shown that the developed learning algorithms can achieve 95% success rate. The proposed framework can be applied to a number of areas, including transportation systems, security, and surveillance.


I. INTRODUCTION
B EHAVIOUR analysis is an important area in intelligent video surveillance, where abnormal behaviour detection is a difficult problem.One of the challenges in this field is informality of the problem formulation.Due to the broad scope of applications and desired objectives there is no unique way, in which normal or abnormal behaviour can be described.In general, the objective is to detect unusual events and inform in due course a human operator about them.
This paper considers a probabilistic framework for anomaly detection, where less probable events are labelled as abnormal.We propose two learning algorithms and an anomaly localisation procedure for spatial detection of abnormal behaviours.

A. Related work
There is a wealth of methods for abnormal behaviour detection, for example, pattern-based methods [1]- [3].These methods extract explicit patterns from data and use them as behaviour templates for decision making.In [1] the sum of the visual features of a reference frame is treated as a normal behaviour template.Another common approach for representing normal templates is using clusters of visual features [2], [3].Visual features can range from raw intensity values of pixels to complex features that exploit the data nature [4].
In the testing stage new observations are compared with the extracted patterns.The comparison is based on some similarity measure between observations, e.g., the Jensen-Shannon divergence in [5] or the Z-score value in [2], [3].If the distance between the new observation and any of the normal patterns is larger than a threshold, then the observation is classified as abnormal.
Abnormal behaviour detection can be considered as a classification problem.It is difficult in advance to collect and label all kind of abnormalities.Therefore, only one class label can be expected and one-class classifiers are applied to abnormal behaviour detection: e.g., a one-class Support Vector Machine [6], a support vector data description algorithm [7], a neural network approach [8], a level set method [9] for normal data boundary determination [10].
Another class of methods rely on the estimation of probability distributions of the visual data.These estimated distributions are then used in the decision making process.Different kinds of probability estimation algorithms are proposed in the literature, e.g., based on non-parametric sample histograms [11], Gaussian distribution modelling [12].Spatiotemporal motion data dependency is modelled as a coupled Hidden Markov Model in [13].Auto-regressive process modelling based on self-organised maps is proposed in [14].
An efficient approach is to seek for feature sets that tend to appear together.These feature sets form typical activities or behaviours in the scene.Topic modeling [15], [16] is an approach to find such kinds of statistical regularities in a form of probability distributions.The approach can be applied for abnormal behaviour detection, e.g., [17]- [19].A number of variations of the conventional topic models for abnormal behaviour detection have been recently proposed: clustering of activity distributions [20], modelling temporal dependencies among activities [21], a continuous model for an object velocity [22].
Within the probabilistic modelling approach [12], [13], [17], [18], [20], [22] the decision about abnormality is mainly made by computing likelihood of a new observation.The comparison of the different abnormality measures based on the likelihood estimation is provided in [19].
Topic modeling is originally developed for text mining [15], [16].It aims to find latent variables called "topics" given the collection of unlabelled text documents consisted of words.In probabilistic topic modeling documents are represented as a mixture of topics, where each topic is assumed to be a distribution over words.
There are two main types of topic models: Probabilistic Latent Semantic Analysis (PLSA) [15] and Latent Dirichlet Al-location (LDA) [16].The former considers the problem from the frequentist perspective while the later studies it within the Bayesian approach.The main learning techniques proposed for these models include maximum likelihood estimation via the Expectation-Maximisation (EM) algorithm [15], variational Bayes inference [16], Gibbs sampling [23] and Maximum a Posteriori (MAP) estimation [24].

B. Contributions
In this paper inspired by ideas from [21] we propose an unsupervised learning framework based on a Markov Clustering Topic Model for behaviour analysis and anomaly detection.It groups possible topic mixtures of visual documents and forms a Markov chain for the groups.
The key contributions of this work consist in developing new learning algorithms, namely MAP estimation using the EM-algorithm and variational Bayes inference for the Markov Clustering Topic Model (MCTM), and in proposing an anomaly localisation procedure that follows concepts of probabilistic topic modeling.We derive the likelihood expressions as a normality measure of newly observed data.The developed learning algorithms are compared with the Gibbs sampling scheme proposed in [21].A comprehensive analysis of the algorithms is presented over real video sequences.The empirical results show that the proposed methods provide more accurate results than the Gibbs sampling scheme in terms of anomaly detection performance.
Our preliminary results with the EM-algorithm for behaviour analysis are published in [25].In contrast to [25] we now consider a fully Bayesian framework, where we propose the EM-algorithm for MAP estimates rather than the maximum likelihood ones.We also propose here a novel learning algorithm based on variational Bayes inference and a novel anomaly localisation procedure.The experiments are performed on more challenging datasets in comparison to [25].
The rest of the paper is organised as follows.Section II describes the overall structure of visual documents and visual words.Section III introduces the dynamic topic model.The new learning algorithms are presented in Section IV, where the proposed MAP estimation via the EM-algorithm and variational Bayes algorithm are introduced first and then the Gibbs sampling scheme is reviewed.The methods are given with a detailed discussion about their similarities and differences.The anomaly detection procedure is presented in Section V.The learning algorithms are evaluated with real data in Section VI and Section VII concludes the paper.

FRAMEWORK
Video analytics tasks can be formulated within the framework of topics modeling.This requires a definition of visual documents and visual words, e.g., as in [20], [21].The whole video sequence is divided into non-overlapping short clips.These clips are treated as visual documents.Each frame is divided next into grid cells of pixels.Motion detection is applied to each of the cells.The cells where motion is detected are called moving cells.For each of the moving cells the motion direction is determined.This direction is further quantised into four dominant ones -up, left, down, right (see Figure 1).The position of the moving cell and the quantised direction of its motion form a visual word.
Each of the visual documents is then represented as a sequence of visual words' identifiers, where identifiers are obtained by some ordering of a set of unique words.This discrete representation of the input data can be processed by topic modeling methods.

A. Motivation
In topic modeling there are two main kinds of distributions -the distributions over words, which correspond to topics, and the distributions over topics, which characterise the documents.The relationship between documents and words is then represented via latent low-dimensional entities called topics.Having only an unlabelled collection of documents, topic modeling methods restore a hidden structure of data, i.e., the distributions over words and the distributions over topics.
Consider a set of distributions over topics and a topic distribution for each document is chosen from this set.If the cardinality of the set of distributions over topics is less than the number of documents, then documents are clustered into groups such that documents have the same topic distribution within a group.A unique distribution over topics is called a behaviour in this work.Therefore, each document corresponds to one behaviour.In topic modeling a document is fully described by a corresponding distribution over topics, which means in this case a document is fully described by a corresponding behaviour.
There are a number of applications where we can observe documents clustered into groups with the same distribution over topics.Let us consider some examples from video analytics where a visual word corresponds to a motion within a tiny cell.As topics represent words that statistically often appear together, in video analytics applications topics define some motion patterns in local areas.
Let us consider a road junction regulated by traffic lights.A general motion on the junction is the same with the same traffic light regime.Therefore, the documents associated to the same traffic light regimes have the same distributions over topics, i.e., they correspond to the same behaviours.
Another example is a video stream generated by a video surveillance camera from a train station.Here it is also possible to distinguish several types of general motion within the camera scene: getting off and on a train and waiting for it.These types of motion correspond to behaviours, where the different visual documents showing different instances of the same behaviour have very similar motion structures, i.e., the same topic distribution.
Each action in real life lasts for some time, e.g., a traffic light regime stays the same and people get on and off a train for several seconds.Moreover, often these different types of motion or behaviours follow a cycle and their changes occur in some order.These insights motivate to model a sequence of behaviours as a Markov chain, so that the behaviours remain the same during some documents and change in a predefined order.The model that has these described properties is called a Markov Clustering Topic Model (MCTM) in [21].The next section formally formulates the model.

B. Model formulation
This section starts from the introduction of the main notations used through the paper.Denote by X the vocabulary of all visual words, by Y the set of all topics and by Z the set of all behaviours, x, y and z are used for elements from these sets, respectively.When an additional element of a set is required it is denoted with a prime, e.g., z is another element from Z.
Let x t = {x i,t } Nt i=1 be a set of words for the document t, where N t is the length of the document t.Let x 1:Ttr = {x t } Ttr t=1 denote a set of all words for the whole dataset, where T tr is the number of documents in the dataset.Similarly, denote by y t = {y i,t } Nt i=1 and y 1:Ttr = {y t } Ttr t=1 a set of topics for the document t and a set of all topics for the whole dataset, respectively.Let z 1:Ttr = {z t } Ttr t=1 be a set of all behaviours for all documents.
Note that x, y and z without subscript denote possible values for a word, topic and behaviour from X , Y and Z, respectively, while the symbols with subscript denote word, topic and behaviour assignments in particular places in a dataset.
Here, Φ is a matrix corresponding to the distributions over words given the topics, Θ is a matrix corresponding to the distributions over topics given behaviours.For a Markov chain of behaviours a vector π for a behaviour distribution for the first document and a matrix Ξ for transition probability distributions between the behaviours are introduced: where the matrices Φ, Θ and Ξ and the vector π are formed as follows.An element of a matrix on the i-th row and j-th column is a probability of the i-th element given the j-th one, e.g., φ x,y is a probability of the word x in the topic y.The columns of the matrices are then distributions for corresponding elements, e.g., θ z is a distribution over topics for the behaviour z.Elements of the vector π are probabilities of behaviours to be chosen by the first document.All these distributions are categorical.
The introduced distributions form a set of model parameters and they are estimated during a learning procedure.
Prior distributions are imposed to all the parameters.Conjugate Dirichlet distributions are used: where Dir(•) is a Dirichlet distribution and β, α, η and γ are the corresponding hyperparameters.As topics and behaviours are not known a priori and will be specified via the learning procedure, it is impossible to distinguish two topics or two behaviours in advance.This is the reason why all the prior distributions are the same for all topics and all behaviours.
The generative process for the model is as follows.All the parameters are drawn from the corresponding prior Dirichlet distributions.At each time moment t a behaviour z t is chosen first for a visual document.The behaviour is sampled using the matrix Ξ according to the behaviour chosen for the previous document.For the first document the behaviour is sampled using the vector π.
Once the behaviour is selected, the procedure of choosing visual words repeats for the number of times equal to the length of the current document N t .The procedure consists of two steps -sampling a topic y i,t using the matrix Θ according to the chosen behaviour z t followed by sampling a word x i,t using the matrix Φ according to the chosen topic y i,t for each token i ∈ {1, . . ., N t }, where a token is a particular place inside a document where a word is assigned.The generative process is summarised in Algorithm III.1.
Algorithm III.1 The generative process for the MCTM Require: The number of clips -T tr , the length of each clip -N t ∀t = {1, . . ., T tr }, the hyperparametersβ, α, η, γ; Ensure: The dataset x 1:Ttr = {x 1,1 , . . ., x i,t , . . ., x N T tr ,Ttr };  draw a visual word for the token i based on the chosen topic: The graphical model, showing the relationships between the variables, can be found in Figure 2. The full likelihood of the observed variables x 1:Ttr , the hidden variables y 1:Ttr and z 1:Ttr and the set of parameters Ω can be written then as follows: In [21] Gibbs sampling is implemented for parameters learning in the MCTM.We propose two new learning algorithms: based on an EM-algorithm for the MAP estimates of the parameters and based on variational Bayes inference to estimate posterior distributions of the parameters.We introduce the proposed learning algorithms below and briefly review the Gibbs sampling scheme.

A. Learning: EM-algorithm scheme
We propose a learning algorithm for MAP estimates of the parameters based on the Expectation-Maximisation algorithm [26].The algorithm consists of repeating E and M-steps.Conventionally, the EM-algorithm is applied to get maximum likelihood estimates.In that case the M-step is: where Ω old denotes the set of parameters obtained at the previous iteration and Q(Ω, Ω old ) is the expected logarithm of the full likelihood function of the observed and hidden variables: The subscript of the expectation sign means the distribution, with respect to which the expectation is calculated.During the E-step the posterior distribution of the hidden variables is estimated given the current estimates of the parameters.
In this paper the EM-algorithm is applied to get MAP estimates instead of traditional maximum likelihood ones.The M-step is modified in this case as: where p(Ω|β, α, η, γ) is the prior distribution of the parameters.
As the hidden variables are discrete, the expectation converts to a sum of all possible values for the whole set of the hidden variables {y 1:Ttr , z 1:Ttr }.The substitution of the likelihood expression from (2) into (5) allows to marginalise some hidden variables from the sum.The remaining distributions that are required for computing the Q-function are as follows: • p(z 1 = z|x 1:Ttr , Ω old ) -the posterior distribution of a behaviour for the first document; • p(z t = z , z t−1 = z|x 1:Ttr , Ω old ) -the posterior distribution of two behaviours for successive documents; • p(y i,t = y|x 1:Ttr , Ω old ) -the posterior distribution of a topic assignment for a given token; • p(y i,t = y, z t = z|x 1:Ttr , Ω old ) -the joint posterior distribution of a topic and behaviour assignments for a given token.With the fixed current values for these posterior distributions the estimates of the parameters that maximise the required functional of the M-step (5) can be computed as: where (a) + def = max(a, 0) [27]; β x , α y and γ z are the elements of the hyperparameter vectors β, α and γ, respectively; and p(y i,t = y, z t = z|x 1:Ttr , Ω old ) is the expected number of times, when the topic y is associated to the behaviour z; n EM z = p(z 1 = z|x 1:Ttr , Ω old ) is the "expected number of times", when the behaviour z is associated to the first document, in this case the "expected number" is just a probability, the notation is used for the similarity with the rest of the parameters; is the expected number of times, when the behaviour z is followed by the behaviour z .
During the E-step with the fixed current estimates of the parameters Ω old , the updated values for the posterior distributions of the hidden variables should be computed.The derivation of the updated formulae for these distributions is similar to the Baum-Welch forward-backward algorithm [28], where the EM-algorithm is applied to the maximum likelihood estimates for a Hidden Markov Model (HMM).This similarity appears because the generative model can be viewed as extension of a HMM.
For effective computation of the required posterior distributions the additional variables άz (t) and βz (t) are introduced.A dynamic programming technique is applied for computation of these variables.Having the updated values for άz (t) and βz (t) one can update the required posterior distributions of the hidden variables.The E-step is then formulated as follows (for simplification of notation the superscript "old" for the parameters variables is omitted inside the formulae): where K is a normalisation constant for all the posterior distributions of the hidden variables.
Starting with some random initialisation of the parameter estimates, the EM-algorithm iterates the E and M-steps until convergence.The obtained estimates of the parameters are used for further analysis.

B. Learning: Variational Bayes scheme
We also propose a learning algorithm based on the variational Bayes (VB) approach [29] to find approximated posterior distributions for both the hidden variables and the parameters.
During the update of the parameters the approximated distribution q(Ω) is further factorised: Note that this factorisation of approximated parameter distributions is a corollary of our model and not an assumption.The iterative process of updating the approximated distributions of the parameters and the hidden variables can be formulated as an EM-like algorithm, where during the Estep the approximated distributions of the hidden variables are updated and during the M-step the approximated distributions of the parameters are updated.
The M-like step is as follows: where βy , αz , η and γz are updated hyperparameters of the corresponding posterior Dirichlet distributions; and n VB x,y = Ttr t=1 Nt i=1 I(x i,t = x)q(y i,t = y) is the expected number of times, when the word x is associated with the topic y.Here and below the expected number is computed with respect to the approximated posterior distributions of the hidden variables; q(y i,t = y, z t = z) is the expected number of times, when the topic y is associated with the behaviour z; n VB z = q(z 1 = z) is the "expected number" of times, when the behaviour z is associated to the first document; is the expected number of times, when the behaviour z is followed by the behaviour z .
The following additional variables are introduced for the E-like step: φx,y = exp ψ βx,y − ψ θy,z = exp where ψ(•) is the digamma function.Using these additional notations, the E-like step is formulated the same as the E-step of the EM-algorithm, replacing everywhere the estimates of the parameters with the corresponding tilde introduced notation and true posterior distributions of the hidden variables with the corresponding approximated ones in ( 10) -( 16).
The point estimates of the parameters can be obtained by expected values of the posterior approximated distributions.An expected value for a Dirichlet distribution (a posterior distribution for all the parameters) is a normalised vector of hyperparameters.Using the expressions for the hyperparameters from ( 19) - (22), the final parameters estimates can be obtained by: C. Learning: Gibbs sampling algorithm In [21] the collapsed version of Gibbs sampling (GS) is used for parameter learning in the MCTM.The Markov chain is built to sample only the hidden variables y i,t and z t , while the parameters Φ, Θ and Ξ are integrated out (note that the distribution for the initial behaviour choice π is not considered in [21]).
During the burn-in stage the hidden topic and behaviour assignments to each token in the dataset are drawn from the conditional distributions given all the remaining variables.Following the Markov Chain Monte Carlo framework it would draw samples from the posterior distribution p(y 1:Ttr , z 1:Ttr |x 1:Ttr , β, α, η, γ).From the whole sample for {y 1:Ttr , z 1:Ttr } the parameters can be estimated by [23]: where n GS x,y is the count for the number of times when the word x is associated with the topic y, n GS y,z is the count for the topic y and the behaviour z pair, n GS z ,z is the count for the number of times when the behaviour z is followed by the behaviour z .

D. Similarities and differences of the learning algorithms
The point parameter estimates for all three learning algorithms ( 6) -( 9), ( 27) -( 30) and ( 31) -(33) have a similar form.The EM-algorithm estimates differ up to the hyperparameters reassignment -adding one to all the hyperparameters in the VB or GS algorithms ends up with the same final equations for the parameters estimates in the EMalgorithm.We explore this in the experimental part.This "-1" term in the EM-algorithm formulae ( 6) -( 8) occurs because it uses modes of the posterior distributions while the point estimates obtained by the VB and GS algorithms are means of the corresponding posterior distributions.For a Dirichlet distribution, which is a posterior distribution for all the parameters, mode and mean expressions differ by this "-1" term.
The main differences of the methods consist in the ways the counts n x,y , n y,z and n z ,z are estimated.In the GS algorithm they are calculated by a single sample from the posterior distribution of the hidden variables p(y 1:Ttr , z 1:Ttr |x 1:Ttr , β, α, γ).In the EM-algorithm the counts are computed as expected numbers of the corresponding events with respect to the posterior distributions of the hidden variables.In the VB algorithm the counts are computed in the same way as in the EM-algorithm up to replacing the true posterior distributions with the approximated ones.
Our observations for the dynamic topic model confirm the comparison results for the vanilla PLSA and LDA models provided in [30].

V. ANOMALY DETECTION
This paper presents on-line anomaly detection with the MCTM in video streams.The decision making procedure is divided into two stages.At a learning stage the parameters are estimated using T tr visual documents by one of the learning algorithms, presented in Section IV.After that during a testing stage a decision about abnormality of new upcoming testing documents is made comparing a marginal likelihood of each document with a threshold.The likelihood is computed using the parameters obtained during the learning stage.The threshold is a parameter of the method and can be set empirically, for example, to label 2% of the testing data as abnormal.This paper presents a comparison of the algorithms (Section VI) using the measure independent of threshold value selection.
We also propose an anomaly localisation procedure during the testing stage for those visual documents that are labelled as abnormal.This procedure is designed to provide spatial information about anomalies, while documents labelled as abnormal provide temporal detection.The following sections introduce both the anomaly detection procedure on a document level and the anomaly localisation procedure within a video frame.

A. Abnormal documents detection
The marginal likelihood of a new visual document x t+1 given all the previous data x 1:t can be used as a normality measure of the document [21]: If the likelihood value is small it means that the current document cannot be fitted to the learnt behaviours and topics, which represent typical motion patterns.Therefore, this is an indication for an abnormal event in this document.The decision about abnormality of a document is then made by comparing the marginal likelihood of the document with the threshold.
In real world applications it is essential to detect anomalies as soon as possible.Hence an approximation of the integral in (34) is used for efficient computation.The first approximation is based on the assumption that the training dataset is representative for parameter learning, which means that the posterior probability of the parameters would not change if there is more observed data: The marginal likelihood can be then approximated as Depending on the algorithm used for learning the integral in (36) can be further approximated in different ways.We consider two types of approximation.
1) Plug-in approximation: The point estimates of the parameters can be plug-in in the integral (36) for approximation: where δ a (•) is the delta-function with the centre in a; Φ, Θ, Ξ are point estimates of the parameters, which can be computed by any of the considered learning algorithms using ( 6) -( 8), ( 27) -( 29) or ( 31) -(33).
The product and sum rules, the conditional independence equations from the generative model are then applied and the final formula for the plug-in approximation is as follows: where the predictive probability of the behaviour for the current document, given the observed data up to the current document, can be computed via the recursive formula: The point estimates can be computed for all three learning algorithms, therefore a normality measure based on the plugin approximation of the marginal likelihood is applicable for all of them.
2) Monte Carlo approximation: If samples {Φ s , Θ s , Ξ s } from the posterior distribution p(Φ, Θ, Ξ|x 1:Ttr ) of the parameters can be obtained, the integral (36) is further approximated by the Monte Carlo method: where S is the number of samples.These samples can be obtained (i) from the approximated posterior distributions q(Φ), q(Θ), and q(Ξ) of the parameters, computed by the VB learning algorithm, or (ii) from the independent samples of the GS scheme.For the conditional likelihood p(x t+1 |x 1:t , Φ s , Θ s , Ξ s ) the formula (38) is valid.
Note that for the approximated posterior distribution of the parameters, i.e., the output of the VB learning algorithm, the integral (36) can be resolved analytically, but it would be computationally infeasible.This is the reason why the Monte Carlo approximation is used in this case.
Finally, in order to compare documents of different lengths the normalised likelihood is used as a normality measure s:

B. Localisation of anomalies
The topic modeling approach allows to compute a likelihood function not only of the whole document but of an individual word within the document too.Recall that the visual word contains the information about a location in the frame.We propose to use the location information from the least probable words (e.g., 10 words with the least likelihood values) to localise anomalies in the frame.Note, we do not require anything additional to a topic model, e.g., modelling regional information explicitly as in [31] or comparing a test document with training ones as in [32].Instead, the proposed anomaly localisation procedure is general and can be applied in any topic modeling based method, where spatial information is encoded to visual words.
The marginal likelihood of a word can be computed in a similar way to the likelihood of the whole document.For the point estimates of the parameters and plug-in approximation of the integral it is: For the samples from the posterior distributions of the parameters and the Monte Carlo integral approximation it is:

VI. PERFORMANCE VALIDATION
We compare the two proposed learning algorithms, based on EM and VB, with the GS algorithm, proposed in [21], on two real datasets.For the representation purposes the three-dimensional space is used.On the top row the colours correspond to the Dirichlet probability density function values in the area.On the bottom row there are samples generated from the corresponding density functions.The sample size is 500.

A. Setup
The performance of the algorithms is compared on the QMUL street intersection data [21] and Idiap traffic junction data [19].Both datasets are 45-minutes video sequences, captured busy traffic road junctions, where we use a 5-minute video sequence as a training dataset and others as a testing one.The documents that have less than 20 visual words are discarded from consideration.In practice these documents can be classified to be normal by default as there is no enough information to make a decision.The frame size for both datasets is 288×360.Sample frames are presented in Figure 3.
The size of grid cells is set to 8 × 8 pixels for spatial quantisation of the local motion for visual word determination.Non-overlapping clips with a one second length are treated as visual documents.
We also study the influence of the hyperparameters on the learning algorithms.In all the experiments we use the symmetric hyperparameters: α = {α, . . ., α}, β = {β, . . ., β}, γ = {γ, . . ., γ} and η = {η, . . ., η}.The three groups of the hyperparameters settings are compared: {α = 1, β = 1, γ = 1, η = 1} (referred as "prior type 1"), {α = 8, β = 0.05, γ = 1, η = 1} ("prior type H") and {α = 9, β = 1.05, γ = 2, η = 2} ("prior type H+1").Note that the first group corresponds to the case when in the EM-algorithm learning scheme the prior components are cancelled out, i.e., the MAP estimates in this case are equal to the maximum likelihood ones.The equations for the point estimates in the EM learning algorithm with the prior type H+1 of the hyperparameters settings are equal to the equations for the point estimates in the VB and GS learning algorithms with the prior type H of the settings.The corresponding Dirichlet distributions with all used parameters are presented in Figure 4.Note that parameter learning is an ill-posed problem in topic modeling [27].This means there is no unique solution for parameter estimates.We use 20 Monte Carlo runs for all the learning algorithms with different random initialisations resulting with different solutions.The mean results among these runs are presented below for comparison.
All three algorithms are run with three different groups of hyperparameters settings.The number of topics and behaviours is set to 8 and 4, respectively, for the QMUL dataset, 10 and 3 are used for the corresponding values for the Idiap dataset.The EM and VB algorithms are run for 100 iterations.The GS algorithm is run for 500 burn-in iterations and independent samples are taken with a 100 iterations delay after the burn-in period.

B. Performance measure
Anomaly detection performance of the algorithms depends on threshold selection.To make a fair comparison of the different learning algorithms we use a performance measure, which is independent of threshold selection.
In binary classification the following measures [28] are used: TP -true positive, a number of documents, which are correctly detected as positive (abnormal in our case); TNtrue negative, a number of documents, which are correctly detected as negative (normal in our case); FP -false positive, a number of documents, which are incorrectly detected as positive, when they are negative; FN -false negative, a number of documents, which are incorrectly detected as negative, when they are positive; precision = TP TP + FP a fraction of correct detections among all documents labelled as abnormal by an algorithm; recall = TP TP + FN -a fraction of correct detections among all truly abnormal documents.
The area under the precision-recall curve is used as a performance measure in this paper.This measure is more informative for detection of rare events than the popular area under the receiver operating characteristic (ROC) curve [28].

C. Parameter learning
We visualise the learnt behaviours for the qualitative assessment of the proposed framework (Figures 5 and 6).For illustrative purposes we consider one run of the EM learning algorithm with the prior type H+1 of the hyperparameters settings.
The behaviours learnt for the QMUL data are shown in Figure 5 (for visualisation words representing 50% of probability mass of a behaviour are used).One can notice that  5a), when cars move downward and upward on the road; left and right turns (the fourth behaviour in Figure 5d): some cars moving on the "vertical" road turn to the perpendicular road at the end of the vertical traffic flow; a left traffic flow (the second behaviour in Figure 5b), when cars move from right to left on the "horizontal" road; and a right traffic flow (the third behaviour in Figure 5c), when cars move from left to right on the "horizontal" road.Note that the ordering numbers of behaviours correspond to their internal representation in the algorithm.The transition probability matrix Ξ is used to recognise the correct behaviours order in the data.Figure 6 presents the behaviours learnt for the Idiap data.In this case the learnt behaviours have also a clear semantic meaning.The scene motion follows a cycle: a pedestrian flow (the first behaviour in Figure 6a), when cars stop in front of the stop line and pedestrians cross the road; a downward traffic flow (the third behaviour in Figure 6c), when cars move downward along the road; an upward traffic flow (the second behaviour in Figure 6b), when cars from left and right sides move upward on the road.

D. Anomaly detection
In this section the anomaly detection performance achieved by all three learning algorithms is compared.The datasets contain the number of abnormal events, such as jaywalking, car moving on the opposite lane, disruption of the traffic flow (see examples in Figure 7).
For the EM learning algorithm the plug-in approximation of the marginal likelihood is used for anomaly detection.For both the VB and GS learning algorithms both the plug-in and Monte Carlo approximations of the likelihood are used.Note that for the GS algorithm samples are obtained during the learning stage, 5 and 100 independent samples are taken.For the VB learning algorithm samples are obtained after the learning stage from the posterior distributions, parameters of which are learnt.This means that the number of samples that are used for anomaly detection does not influence on the computational cost of learning.We test the Monte Carlo approximation of the marginal likelihood with 5 and 100 samples for the VB learning algorithm.As a result, we have 21 methods to compare: obtained by three learning algorithms, three different groups of hyperparameters settings, one type of marginal likelihood approximation for the EM learning algorithm, two types of marginal likelihood approximation for the VB and GS learning algorithms, where two Monte Carlo approximations are used with 5 and 100 samples.The list of methods references can be found in Table I.
Note that we achieve a very fast decision making performance in our framework.Indeed, anomaly detection is made for approximately 0.0044 sec per visual document by the plug-in approximation of the marginal likelihood, for 0.0177 sec per document by the Monte Carlo approximation with 5 samples and for 0.3331 sec per document by the Monte Carlo approximation with 100 samples2 .
The mean areas under precision-recall curves for anomaly detection for all 21 compared methods can be found in Figure 8. Below we examine the results with respect to hyperparameters sensitivity, an influence of the likelihood approximation on the final performance, we also compare the learning algorithms and discuss anomaly localisation results.
1) Hyperparameters sensitivity: This section presents sensitivity analysis of the anomaly detection methods with respect to changes of the hyperparameters.The analysis of the mean areas under curves (Figure 8) suggests that the hyperparameters almost do not influence on the results of the EM learning algorithm, while there is a significant dependence between hyperparameters changes and results of the VB and GS learning algorithms.These conclusions are confirmed by examination of the individual runs of the algorithms.For example, Figure 9 presents the precisionrecall curves for all 20 runs with different initialisations of 4 methods for the QMUL data: the VB learning algorithm using the plug-in approximation of the marginal likelihood with the prior types 1 and H of the hyperparameters settings and the EM learning algorithm with the same prior groups of the hyperparameters settings.One can notice that the variance of the curves for the VB learning algorithm with the prior type 1 is larger than the corresponding variance with the prior type H, while the similar variances for the EM learning algorithm are very close to each other.
Note that the results of the EM learning algorithm with the prior type 1 do not significantly differ from the results with the other priors, despite of the fact that the prior type 1 actually cancels out the prior influence on the parameters estimates and equates the MAP and maximum likelihood estimates.We can conclude that the choice of the hyperparameters settings is not a problem for the EM learning algorithm and we can even simplify the derivations considering only the maximum likelihood estimates without the prior influence.
The VB and GS learning algorithms require a proper choice of the hyperparameters settings as they can significantly change the anomaly detection performance.This choice can be performed empirically or with the type II maximum likelihood approach [28].
2) Marginal likelihood approximation influence: In this section the influence of the type of the marginal likelihood approximation on the anomaly detection results is studied.
The average results for both datasets (Figure 8) demonstrate that the type of the marginal likelihood approximation does not influence remarkably on anomaly detection performance.As the plug-in approximation requires less computational resources both in terms of time and memory (as there is no need to sample and store posterior samples and average among them) this type of approximation is recommended to be used for anomaly detection in the proposed framework.
3) Learning algorithms comparison: This section compares the anomaly detection performance obtained by three learning algorithms.
The best results in terms of a mean area under a precisionrecall curve are obtained by the EM learning algorithm, the worst results are obtained by the GS learning algorithm (Figure 8 and Table II).In Table II for each learning algorithm the group of hyperparameters settings and the type of marginal likelihood approximation is chosen to have the maximum of the mean area under curves, where a mean is taken over independent runs of the same method and maximum is taken among different settings for the same learning algorithm.
Figure 10 presents the best and the worst precision-recall curves (in terms of the area under them) for the individual runs of the learning algorithms.The figure shows that among the individual runs the EM learning algorithm also demonstrates the most accurate results.Although, the minimum area under the precision-recall curve for the EM learning algorithm is  Figure 10.Precision-recall curves with the maximum and minimum areas under curves for the three learning algorithms (maximum and minimum is among all the runs with different initialisations for all groups of hyperparameters settings and all types of marginal likelihood approximations).(a) presents the "best" curves for the QMUL data, i.e., the curves with the maximum area under a curve.(b) presents the "worst" curves for the QMUL data, i.e., the curves with the minimum area under a curve.(c) presents the "best" curves for the Idiap data, (d) -the "worst" curves for the Idiap data.The variance of the precision-recall curves for both VB and GS learning algorithms is relatively small.However, the VB learning algorithm has the curves higher than the curves obtained by the GS learning algorithm.It can be confirmed by examination of the best and worst precision-recall curves (Figure 10) and the mean values of the area under curves (Figure 8 and Table II).
We also present the results of classification accuracy, i.e., the fraction of the correctly classified documents, for anomaly detection, which can be achieved with some fixed threshold.The best classification accuracy for the EM learning algorithm in both datasets can be found in Table III.4) Anomaly localisation: We apply the proposed method for anomaly localisation, presented in Section V-B, and get promising results.We demonstrate the localisation results for the EM learning algorithm with the prior type H+1 on both datasets in Figure 11.The red rectangle is manually set to locate the abnormal events within the frame, the arrows correspond to the visual words with the smallest marginal likelihood computed by the algorithm.It can be seen that the abnormal events correctly localised by the proposed method.
For quantitative evaluation we analyse 10 abnormal events (5 from each dataset).For each clip for a given number N top of the least probable words, we measure the recall: recall = TP

N an
, where N an is the maximum possible number of abnormal words among N top , i.e., N an = N top if N top ≤ N total an , where N total an is the total number of abnormal words, and N an = N total an if N top > N total an .Figure 12 presents the mean results for all events.One can notice, for example, that when the localisation procedure can possibly detect 45% of the total number of abnormal words, it correctly finds ≈ 90% of them.

VII. CONCLUSIONS
This paper presents two learning algorithms for the dynamic topic model for behaviour analysis in video: the EM-algorithm is developed for the MAP estimates of the model parameters and a variational Bayes inference algorithm is developed for calculating the posterior distributions of them.A detailed comparison of these proposed learning algorithms with the Gibbs sampling based algorithm developed in [21] is presented.The differences and the similarities of the theoretical aspects for all three learning algorithms are well emphasised.The empirical comparison is performed for abnormal behaviour detection using two unlabelled real video datasets.Both proposed learning algorithms demonstrate more accurate results than the algorithm proposed in [21] in terms of anomaly detection performance.
The EM learning algorithm demonstrates the best results in terms of the mean values of the performance measure, obtained by the independent runs of the algorithm with different random initialisations.Although, it is noticed that the variance among the precision-recall curves of the individual runs is relatively high.The variational Bayes learning algorithm shows the smaller variance among the precision-recall curves than the EM-algorithm.The results show that the VB algorithm answers are more robust to different initialisation values.However, it is shown that the results of the algorithm are significantly influenced by the choice of the hyperparameters.The hyperparameters require additional tuning before the algorithm can be applied to data.Note that the results of the EM learning algorithm only slightly depend on the choice of the hyperparameters settings.Moreover, the hyperparameters can be even set in such a way as the EM algorithm is applied to obtain the maximum likelihood estimates instead of the maximum a posteriori ones.Both proposed learning algorithms -EM and VB -provide more accurate results in comparison to the Gibbs sampling based algorithm.
We also demonstrate that consideration of marginal likelihoods of visual words rather than visual documents can provide satisfactory results about locations of anomalies within a frame.In our best knowledge the proposed localisation procedure is the first general approach in probabilistic topic modeling that requires only presence of spatial information encoded in visual words.

APPENDIX A EM-ALGORITHM DERIVATIONS
This Appendix presents the details of the proposed EM learning algorithm derivation.The objective function in the EM-algorithm is: On the M-step the function ( 44) is maximised with respect to the parameters Ω with fixed values for p(z 1 |x 1:Ttr , Ω Old ), p(z t , z t−1 |x 1:Ttr , Ω Old ), p(y i,t |x 1:Ttr , Ω Old ), p(y i,t , z t |x 1:Ttr , Ω Old ).The optimisation problem can be solved separately for each parameter, which leads to the equations ( 6) - (8).
On the E-step for the efficient implementation the forwardbackward steps are developed for the auxiliary variables άz (t) and βz (t): The recursive formula (11) is obtained by interchanging the terms in (46).The required posterior of the hidden variables terms p(z 1 |x 1:Ttr , Ω Old ), p(z t , z t−1 |x 1:Ttr , Ω Old ), p(y i,t |x 1:Ttr , Ω Old ), p(y i,t , z t |x 1:Ttr , Ω Old ) are then expressed via the axillary variables άz (t) and βz (t), which leads to (13) - (16).

APPENDIX B VB ALGORITHM DERIVATIONS
This Appendix presents the details of the proposed variational Bayes inference derivation.We have separated the parameters and the hidden variables.Let us consider the update formula of the variational Bayes inference scheme [28] for the parameters: log q(Ω) = Const+ E q(y 1:T tr ,z 1:T tr ) log p(x 1:Ttr , y 1:Ttr , z 1:Ttr , Ω|η, γ, α, β) = Const + E q(y 1:T tr ,z One can notice that log q(Ω) is further factorised as in (18).Now each factorisation term can be considered independently.Derivations of the equations ( 19) -( 22) are very similar to each other.We provide the derivation only of the term q(Φ): log q(Φ) = Const+ It can be noticed from (48) that the distribution of Φ is a product of the Dirichlet distributions (19).
The update formula in the variational Bayes inference scheme for the hidden variables is as follows: log q(y 1:Ttr , z 1:Ttr ) = Const+ E q(π)q(Ξ)q(Θ)q(Φ) log p( We know from the parameters update ( 19) -( 22) that their distributions are Dirichlet.Therefore, E q(π) log π z = ψ (η z ) − ψ z ∈Z ηz and similarly for all the other expected value expressions.
Using the introduced notations ( 23) -( 26) the update formula (49) for the hidden variables can be then expressed as: log q(y 1:Ttr , z The approximated distribution of the hidden variables is then: q(y 1:Ttr , z 1:Ttr ) = φ xi,t,yi,t θ yi,t,z (52) Therefore, to compute the required expressions of the hidden variables q(z 1 = z), q(z t−1 = z, z t = z ), q(y i,t = y, z t = z) and q(y i,t = y) one can use the same forward-backward procedure and update formula as in the E-step of the EM-algorithm replacing all the parameter variables with the corresponding introduced tilde variables.

Figure 1 .
Figure 1.Structure of the visual feature extraction: from an input frame (on the left) a map of local motions is calculated (in the centre).The motion is quantised into four directions to get the feature representation (on the right).

Figure 2 .
Figure 2. Graphical representation of the Markov Clustering Topic Model.

1 :
for all y ∈ Y do 2: draw a word distribution for the topic y: φ y ∼ Dir(φ y |β); 3: for all z ∈ Z do 4: draw a topic distribution for behaviour z: θ z ∼ Dir(θ z |α); 5: draw a transition distribution for behaviour z: ξ z ∼ Dir(ξ z |γ); is the expected number of times, when the word x is associated to the topic y, where I(•) is the indicator function; n EM y,z = Ttr t=1 Nt i=1

Figure 3 .
Figure 3. Sample frames of the real datasets.The top row presents two sample frames from the QMUL data, the bottom row presents two sample frames from the Idiap data.

Figure 4 .
Figure 4. Dirichlet distributions with different symmetric parameters ξ.For the representation purposes the three-dimensional space is used.On the top row the colours correspond to the Dirichlet probability density function values in the area.On the bottom row there are samples generated from the corresponding density functions.The sample size is 500.

Figure 5 .Figure 6 .
Figure 5. Behaviours learnt by the EM learning algorithm for the QMUL data.The arrows represent the visual words: the location and direction of the motion.The first behaviour (a) corresponds to the vertical traffic flow, the second (b) and the third (c) behaviours correspond to the left and right traffic flow, respectively.The fourth (d) behaviour correspond to turns that follow the vertical traffic flow.

Figure 7 .
Figure 7. Examples of abnormal events

Figure 8 .Figure 9 .
Figure 8. Results of anomaly detection.(a) are the mean areas under precision-recall curves for the QMUL data.(b) are the mean areas under precision-recall curves for the Idiap data.

Figure 11 .RecallFigure 12 .
Figure 11.Examples of anomalies localisation.The red rectangle is the manual localisation.The arrows represent the visual words with the smallest marginal likelihood, the locations of the arrows are the results of the algorithmic anomaly localisation.

1 :I
Ttr ) = Const + z∈Z I (z 1 = z) log πz + Ttr t=2 z∈Z z∈Z I (z t = z) I (z t−1 = z) log ξz,z + (y i,t = y) log φxi,t,y + Ttr t=1 Nt i=1 z∈Z y∈Z I (y i,t = y) I (z t = z) log θy,z ,yi,t θyi,t,zt , (51)where K is a normalisation constant.Note that the expression of the true posterior distribution of the hidden variables is the same up to replacing the true parameters variables with the corresponding tilde variables: p(y 1:Ttr , z 1:Ttr |x 1:Ttr , Ω)

Table II MEAN
AREA UNDER PRECISION-RECALL CURVES