Introduction
Pedestrians in a crowded scene exhibit interesting patterns of motion over time. Analysing these motion patterns across different types of crowded scenes helps to understand complex crowd behaviours, detect anomalous behaviours and predict unforeseen events which could pose a threat to the safety of the crowd.
Motion pattern segmentation is an automated visual surveillance task that divides a scene into regions of consistent and coherent motion. Grouping the patterns of crowd motion simplifies the process of crowd behaviour understanding/recognition and crowd anomaly detection [1]–[4] (the terms motion pattern segmentation and group detection are used interchangeably and have the same meaning in the context of this article). In spite of various efforts [5]–[10], precise motion pattern segmentation remains a challenging task due to varying crowd dynamics across different types of scenes and intricate interactions between the pedestrians within a scene. In general, a crowded scene can be either structured or unstructured. In a structured crowded scene, the direction of motion of the crowd remains same for most of the time and is easily predictable. Whereas, in an unstructured crowded scene, the direction of motion of the crowd changes chaotically and is unpredictable. Hence, a model which is designed specifically for one type of crowded scene need not be efficient with the other. Most of the attempts to tackle the problem of motion pattern segmentation perform less efficiently when the type of scene changes, which results in wrongly detected segments.
In this article, we propose a scene-independent motion pattern segmentation approach by using the spatial as well as angular features of a trajectory and subsequently applying an improvised density-based clustering algorithm to group trajectories exhibiting similar motion patterns. The input trajectories generated by using the generalized KLT (gKLT) based key-point tracker [7] are averaged for a given period of time. The average spatial location and average angular orientation of each trajectory constitute its spatial and angular features, respectively. The spatio-angular features are then used to generate spatial information in terms of k-nearest neighbours of each trajectory and angular information in terms of angular deviation between trajectories. Finally, the spatio-angular information is used by the density-based clustering algorithm to group similar motion patterns. For simplicity, we denote our approach as SADC (Spatio Angular Density-based Clustering).
The proposed SADC algorithm is evaluated on the publicly available standard CUHK dataset [8] containing different types of scenes of varying densities and perspectives with ground truth available for motion pattern segmentation. For analysis of performance across different kinds of scenes, we divide the CUHK dataset into meaningful scene categories. Comparison of our results with the state-of-the-art motion pattern segmentation techniques shows that the proposed SADC approach is efficient and computationally faster than other techniques.
The four main contributions of this article are: (i) a faster, efficient and scene-independent method for grouping similar motion patterns, (ii) an averaging-based approach to extract meaningful features of crowd motion, (iii) an improvised density-based clustering algorithm where cluster membership is decided based on closeness of spatial and angular features of a trajectory when compared to other trajectories and (iv) a detailed analysis across different scene categories, which proposes a range of parameter values applicable for each scene category.
Related Work
There have been numerous attempts to improve the efficiency of the process of motion pattern segmentation in crowded scenes starting from the Lagrangian Particle Dynamics based approach proposed by Ali and Shah [5]. A concise review of these approaches can be found in [11]. According to [11], these approaches have been divided into three categories: flow field model-based, similarity-based, and probability model-based approaches. Most of these approaches either use homogeneous datasets (in terms of scene dynamics) or they create ground truth which are in-line to their proposed method. In this article we use a similarity-based clustering approach which is evaluated over the CUHK dataset, containing diverse real-time scenarios along with ground truth for motion pattern segmentation. Hence, the remainder of this section discusses about: (i) the notable and recent efforts towards efficient motion pattern segmentation/group detection using CUHK dataset, (ii) the approaches which are related to similarity-based trajectory clustering.
It was Zhou et al. [7] who initially created a Collective motion dataset in order to measure the collective behaviour within the crowd. In their proposed approach, an algorithm called Collective Merging (CM) was used to find coherent groups of trajectories within the crowd. The CM algorithm models (i) the local behaviour of the crowd using a weighted k-Nearest Neighbour (k-NN) graph and (ii) the global behaviour among the non-neighbours by finding similar paths in the k-NN graph. Subsequently Shao et al. [8] created the CUHK dataset from the Collective Motion dataset by adding new scenes and defining ground truth for the detection of similar motion patterns/groups within a scene (more details about CUHK can be found in Section IV-A). In their work, a Collective Transition (CT) algorithm was proposed for group detection. The CT algorithm improved an earlier approach for group detection based on Coherent Filtering (CF) [6], by modelling the trajectory data using Markov Chains. Both the CF and CT based algorithms used the concept of Coherent Neighbour Invariance (initially introduced in [6]) to keep track of the persistent members within a group over the course of time.
In another interesting work, Wang et al. [9] used a thermal diffusion-based model to generate a strong coherent motion field (known as the thermal energy field) from the noisy optical flow field computed from the input crowd video. Triangulation-based boundary detection, watershed segmentation algorithm and two-step graph-based clustering strategy were applied over the thermal energy field to cluster coherent motion regions. Trojanova et al. [12] adopted a weighted k-NN graph-based clustering approach, which used a data-driven threshold as compared to a static threshold in [7]. Fan et al. [13] used a Natural Nearest Neighbour algorithm (3N) to adaptively determine the optimal number of the nearest neighbours (the k-value) as compared to k-NN based approach where the k-value needs to be experimentally determined. In their work, the 3N algorithm generates a crowd motion network from which similar motion patterns are detected using the concept of coherent neighbour invariance. A different approach called Hybrid Social Influence Model (HSIM) was proposed by Ullah et al. [10], which used a density-independent version of the Social Force Model [14], [15] to model crowd motion and Communal model [16] to group similar motion patterns. The topic models which are popular in language processing, have been used by Chen et al. [17] for group detection. In their proposed approach, after dividing the input crowd image into a fixed number of patches (using a Simple Linear Iterative Clustering algorithm), a descriptor was computed for each patch by combining the feature points generated by the gKLT tracker and orientation distribution of each feature points within the patch. The Latent Dirichlet Allocation-based model is then combined with the Markov Random Field to determine the topics. Based on these topics the features are grouped together. In a recent work, Wang et al. [18] proposed a self-weighted multiview clustering approach that combines an orientation-based graph and a structural context-based graph and apply a tightness-based merging strategy to detect groups within the crowd.
The approaches in [6]–[9] detect groups frame-by-frame which results in group-switching (or cluster-switching) for each frame. Motion pattern segmentation being a prior to crowd behavioural monitoring/ anomaly detection process, regular switching of groups/clusters creates inconsistent results for these processes. Our approach is similar to He and Liu [19] who used a Density-based Clustering algorithm to perform motion pattern analysis in crowded scenes. However, there are a couple of differences in the approaches: (i) to extract and represent the motion information from input crowded video scene, we compute the averaged trajectory generated from an efficient and accurate KLT tracker [6], [20] compared to He et al. who used a global optical flow field computed from the traditional Lucas Kanade-based approach [21], (ii) to determine the spatial proximity, we employ a nearest neighbour-based strategy compared to the Euclidean-distance-based thresholding approach which is scene-dependent and fails when the scene-perspective changes.
The task of motion pattern segmentation can also be considered as a trajectory clustering problem. Trajectory clustering have been extensively researched and the recent surveys on moving object trajectory clustering methods presented in [22]–[24] point out the challenges faced. Finding a suitable measure to compute the similarity among trajectories with varied properties and finding a suitable algorithm to cluster the trajectories based on their similarities are the two main challenges in trajectory clustering. Various similarity measures [25]–[28] and clustering algorithms [18], [19], [29]–[32] have been used for trajectory based motion pattern analysis. In our approach we average the trajectories and utilise the averaged vector information to perform clustering.
Most of the motion pattern segmentation approaches discussed so far are either applicable to specific categories of crowded scene, or suffer from excessive cluster switching or are computationally intensive. This article proposes a spatio-angular density-based clustering approach which is efficient and stable for different scene categories, which has minimal cluster switching and is computationally faster.
Spatio-Angular Density-Based Clustering (SADC) for Motion Pattern Segmentation
This article considers scenarios involving sparse to dense crowds in shopping malls, train stations, escalators, street/sidewalk/market, crosswalks, road-traffic, marathon, military-parade, public events, other indoor and outdoor scenes which are under surveillance for crowd management using static cameras. According to prior research [7], [33], the social behaviour of people walking in groups is such that all the members within the group tend to move in same direction and are spatially close to each other. Based on these scenarios one can infer that a set of spatially close motion patterns are considered to be in the same group, if they move together in the same direction. This work proposes a spatio-angular density-based clustering for detecting such coherent groups. The block diagram shown in Figure 1 illustrates the three phases of proposed approach, namely, (i) Extraction of motion information, where motion information in the form of trajectories (tracks) is extracted from the input video by tracking key-points using a gKLT Tracker, (ii) Computation of Angular & Spatial Information from the motion features, where the motion features in the form of average angular orientation and average spatial location (computed from each of the trajectories) are used to create angular and spatial information matrices, respectively, (iii) Improvised Density-based clustering, which generate clusters containing similar motion patterns by considering the angular and spatial information. The proposed approach is explained in detail in the following sections.
Overview of the proposed approach where,
A. Extraction of Motion Information
Given an input video, the first step is to capture the motion of the crowd. This is usually done by tracking a set of key-point features across the frames in the video which results in the generation of trajectories (or tracks) [34]. The problem of extracting motion information is now a tracking problem. The proposed approach uses the generalized KLT tracker (gKLT tracker [7] which is derived from the traditional KLT tracker [20]) because of its tracking accuracy and computational efficiency. Each of the generated trajectories is a combination of 2-D spatial co-ordinates
B. Computation of Angular and Spatial Information From the Motion Features
Two types of motion features are extracted from the refined trajectories namely angular orientation and spatial location. The angular orientation feature gives the direction of motion of a crowd. If we consider only the directions of motion patterns P & Q (as shown in Figure 2(a) & 2(b)) as a feature vector, it gets clustered as a same set of motion trajectories. In reality, they should form different set of trajectories. Hence, this work introduces a second motion feature, the spatial location, that enables to identify trajectories that are spatially distant from each other.
(a) A frame from a crowded video scene, (b) Two motion patterns P & Q, (c) P & Q represented in space with arrow pointing towards the direction of motion, (d) The directions of the motion patterns P & Q represented as angles
Since the length of the trajectories are not same, computing the pair-wise similarities between them requires normalization of trajectory lengths. Averaging the trajectory data is not only an effective way to normalize the length of the trajectories but also enables to capture the behaviour of the trajectories over time. The refined trajectory data is averaged before computing the two motion features as suggested in [12]. If the trajectory data for a trajectory \begin{equation*} \bar {s_{t_{i}}}= \frac {1}{m} \sum _{k=1}^{m}(x_{k_{t_{i}}},y_{k_{t_{i}}}) \tag{1}\end{equation*}
To obtain the average angular orientation feature, the average displacement (\begin{equation*} \overline {V}_{t_{i}}= \frac {1}{(m-1)} \sum _{k=2}^{m} \left ({(x_{k_{t_{i}}} - x_{(k-1)_{t_{i}}}),(y_{k_{t_{i}}} - y_{(k-1)_{t_{i}}}) }\right) \tag{2}\end{equation*}
\begin{align*} {\overline {\theta }_{t_{i}}} = \begin{cases} \displaystyle cos^{-1} \left ({\frac {\overline {V}. \hat {V}}{||\overline {V}||*||\hat {V}||} }\right)* \frac {180}{\pi }, & \overline {v}>0 \\ \displaystyle \left [{\! 2\pi \!-\!cos^{-1} \!\left ({\!\frac {\overline {V}. \hat {V}}{||\overline {V}||*||\hat {V}||} \!}\right) \!}\right] * \frac {180}{\pi }, & \overline {u}\neq 0, \overline {v}\leq 0\\ \displaystyle 0, & \overline {u},\overline {v}=0 \end{cases}\!\!\!\!\!\!\!\!\!\!\! \\ \tag{3}\end{align*}
The obtained motion features (\begin{equation*} d_{angular}(\overline {\theta }_{t_{i}},\overline {\theta }_{t_{j}}) = min(|\overline {\theta }_{t_{i}}-\overline {\theta }_{t_{j}}|, 2\pi - |\overline {\theta }_{t_{i}} - \overline {\theta }_{t_{j}}|) \tag{4}\end{equation*}
On the other hand, the spatial information matrix conveys whether the trajectories are spatially close to each other and is constructed by finding the k-Nearest Neighbours for each trajectory. Pair-wise Euclidean distance between the average spatial location features is used to find the nearest neighbours. The spatial information-based matrix is defined as follows:\begin{align*} S(i,j) = \begin{cases} 1, & \text {if } t_{j} \in N(t_{i})\\ 0, & \text {otherwise} \end{cases} \tag{5}\end{align*}
\begin{equation*} k = \alpha * n \tag{6}\end{equation*}
C. Improvised Density-Based Clustering
The obtained angular and spatial information depicts how the motion patterns are spread across the scene. Finding similar motion patterns and grouping them (using the obtained information) becomes a clustering problem. The crowded scenes can contain motion patterns of arbitrary shape/size and can have motion patterns which does not belong to any group (noisy data). A density-based clustering algorithm would be an obvious choice in such cases. In the proposed approach, we use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. As proposed by Ester et al. [35], DBSCAN consists of two parameters: (i) the
Algorithm 1 Spatio-Angular Density-Based Clustering (SADC)
Set of ‘
‘
for k = 1 to n do
Compute
Compute the matrices
Perform DBSCAN to obtain ‘
for each
if
Assign
end if
end for
Experimental Results and Discussion
A. Dataset
CUHK dataset [8] is the only publicly available standard dataset with ground truth for motion pattern segmentation. The dataset contains a total of 474 videos captured from 215 scenes out of which 300 videos are used for the purpose of motion pattern segmentation. The proposed approach is evaluated on these 300 videos captured across different types of scenes. For the purpose of analysis, these videos are divided (manually) into different categories based on (i) the property of crowd movement as (a) structured, (b) unstructured, (ii) the location of the scene as (a) Indoor, (b) Outdoor and (iii) the type of the scene as (a) cross walk, (b) escalator, (c) mass movement, (d) market (e) public walkway, (f) shopping mall, (g) street. Figure 3 and Figure 4 shows example scenes and statistics, respectively for this categorization.
Example scenes for (a) cross walk, (b) escalator, (c) market, (d) mass movement, (e) public walkway, (f) shopping mall, (g) station, and (h) street. Scenes (d) and (f) are good examples for structured and unstructured, indoor and outdoor scenes respectively.
CUHK dataset categorization statistics. It can be observed that most of the Mass Movement scenes are Structured. Otherwise, majority of the scenes within the dataset are Unstructured.
B. Experimental Set-Up
The experiments are performed, over all the 300 scenes of CUHK-dataset, on a personal computer with Intel Core i5-8400@2.80 GHz processor and 8 GB RAM. Firstly, we replicate the experimental set-up used by Shao et al. [8] to extract motion information. The gKLT tracker is initialized with a set of 3000 key-points which are tracked across the frames in the video. Short/static/noisy trajectories are filtered out by discarding those having their trajectory length less than 10-frames & those with zero-displacement vectors for more than half of its duration. Secondly, for the purpose of grouping, trajectory data from multiple frames (30-frames) are considered in our approach. By doing so, we aim to obtain more information on the history of the path followed by a trajectory. Finally, for the purpose of clustering, the values for angular threshold & the
C. Results
To prove the effectiveness of the proposed approach, its performance is compared with the state-of-the-art approaches CM [7] and CT [8], whose binaries are publicly available. Note that, we have not considered CF for comparison because the CT algorithm (which is considered) is an improvement over CF. We have kept the original setting for the parameters of CM and CT. Since both the approaches generate clusters for each frame, for the purpose of evaluation and comparison, we choose the clusters generated at a particular frame (for each scene) as defined in the CUHK-dataset and align the proposed approach accordingly.
1) Qualitative Results
The category-wise qualitative results are shown in the Figure 5. It can be observed that the proposed SADC-algorithm performs well in all categories of the CUHK dataset even in case of complex scenes (Figure 5, columns 5-8). In fact, most of the clusters generated by the SADC-algorithm are more closer to the ground truth clusters when compared to CM [7] and CT [8]. The CM-algorithm uses an graph-based approach which relies on finding similar weighted paths and a connected component-based algorithm which clusters the similar nodes on the graph based on a threshold. While the path-based approach is effective, the less effective connected component-based clustering does not generate accurate clusters when scene-type changes. Also, the CM algorithm is dependent on the
2) Quantitative Results
For quantitative evaluation, motion pattern segmentation is considered as a clustering problem and the performance evaluation done using four widely used external cluster evaluation measures such as Purity [36], Normalized Mutual Information (NMI) [37] & Rand Index (RI) [38]. All the performance measures are in the range of [0, 1], where a higher value indicates a better clustering performance. The comparison results in Table 1 shows the average performance for 300 scenes of CUHK-dataset and indicates that the proposed approach outperforms the CM and CT-based approaches.
Notice that, the NMI value for CT is lesser than CM, but CT performs better than CM for other metrics. On inspection, we found that the NMI-metric always generates NMI = 0 if either one of the two clustering assignment (ground truth or clustering result) has a single cluster but the other clustering assignment has more than one cluster, which is not desirable. This is due to an underlying mathematical issue with the computation of Mutual Information, the discussion of which is beyond the scope of this article. Out of 300 scenes in CUHK dataset, 25% of scenes are having one-cluster ground truth. This is one of the main reason for other literatures reporting a lower value of NMI when evaluated over the CUHK dataset.
Figures 6 & 7 shows the comparison results over different scene categories as defined in Section IV-A. For fair evaluation, we have considered equal number of scenes for each scene-category by randomly sampling the scenes. The proposed SADC algorithm performs not only the best among the three algorithms but also performs above par in all scene categories. This is because of the fact that, (i) SADC considers the history of the trajectory for multiple frames, which plays a crucial role during clustering, (ii) SADC computes the angular deviation between the averaged trajectory vectors and is well complemented by the k-nearest neighbour method, both of which are effective inputs to perform efficient density-based clustering for different scene categories. The NMI issue, as discussed earlier, is clearly evident from Figure 6 & 7, where the Mass Movement scene category has average NMI value close to zero for CM and CT and low for SADC (in this case, the value for RI-metric can be considered). Mass Movement category involves scenes like Marathon, Military Parade, Protest Rallies whose ground truth, in most cases, contains only one motion pattern.
Quantitative comparison result based on - Left: Property of crowd movement and Right: Scene location (Best viewed in color).
D. Parameter Analysis
As discussed in Section III-B & III-C, the proposed approach depends on two parameters, namely
Average NMI, Purity and Rand Index profiles respectively (from left to right) with respect to varying
Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying
Figure 9 indicates the performance profiles of SADC algorithm with various combinations of
Further, in Figures 10 & 11 we report the performance profiles for each of the scene categories by keeping alpha constant (
Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying
Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying
E. Computational Complexity
Time taken to generate the motion pattern segmentation results play an important role, especially in real-time scenarios. To calculate the time taken, we use the setting as mentioned in Section IV-B and report the average time taken (in seconds) to generate the clustering result for 30-frames over the entire 300 scenes of CUHK dataset. Note that we do not include the time taken to generate the trajectories as they are the same for all three methods (CM, CT, SADC). Both CM and CT need to iterate frame-by-frame to perform clustering, whereas the proposed SADC algorithm generates clusters for every 30-frames. Therefore, the time taken by SADC to generate clusters is significantly faster than the other two methods, which is substantiated in Table 2.
Conclusion and Future Work
This article proposed a spatio-angular density-based clustering approach to group similar motion patterns. Spatial and angular features are obtained by averaging the trajectories across a fixed set of frames to capture better information about the history of the trajectories. These two features are then utilized to facilitate an improvised density-based clustering algorithm where the cluster membership is decided on the basis of two parameters, namely angular deviation threshold and spatial threshold. Qualitative and quantitative analysis through a set of parameters for different scene categories have shown the robustness of the proposed algorithm.
In the future work, we plan to integrate the crowd anomaly detection framework with our proposed approach. Furthermore, we intend to explore more on the inclusion of new features to tackle complex scenarios.