Loading web-font TeX/Math/Italic
Scene-Independent Motion Pattern Segmentation in Crowded Video Scenes Using Spatio-Angular Density-Based Clustering | IEEE Journals & Magazine | IEEE Xplore

Scene-Independent Motion Pattern Segmentation in Crowded Video Scenes Using Spatio-Angular Density-Based Clustering


The proposed Spatio-Angular Density-based Clustering (SADC) approach operates by processing the Spatial and Angular motion features obtained through the trajectories gene...

Abstract:

Motion pattern segmentation for crowded video scenes is an open problem because of the inability of existing approaches to tackle unpredictable crowd behaviour across var...Show More

Abstract:

Motion pattern segmentation for crowded video scenes is an open problem because of the inability of existing approaches to tackle unpredictable crowd behaviour across varied scenes. To address this problem, we propose a Spatio-Angular Density-based Clustering (SADC) approach, which performs motion pattern segmentation by clustering the spatial and angular information obtained from the input trajectories. The k-nearest neighbours of each trajectory and the angular deviation between trajectories constitute the spatial and angular information, respectively. Effective integration of the spatio-angular information with an improvised density-based clustering algorithm makes this approach scene-independent. The performance of most clustering algorithms in the literature is parameter-driven. Choosing a single parameter value for different types of scenes decreases the overall clustering performance. In this article, we have shown that our approach is robust to scene changes using a single threshold, and, through the analysis of parameters across eight commonly occurring crowded scenarios, we point out the range of thresholds that are suitable for each scene category. We evaluate the proposed approach on the benchmarked CUHK dataset. The experimental results show the superior clustering performance and execution speed of the proposed approach when compared to the state-of-the-art over different scene categories.
The proposed Spatio-Angular Density-based Clustering (SADC) approach operates by processing the Spatial and Angular motion features obtained through the trajectories gene...
Published in: IEEE Access ( Volume: 8)
Page(s): 145984 - 145994
Date of Publication: 10 August 2020
Electronic ISSN: 2169-3536

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Pedestrians in a crowded scene exhibit interesting patterns of motion over time. Analysing these motion patterns across different types of crowded scenes helps to understand complex crowd behaviours, detect anomalous behaviours and predict unforeseen events which could pose a threat to the safety of the crowd.

Motion pattern segmentation is an automated visual surveillance task that divides a scene into regions of consistent and coherent motion. Grouping the patterns of crowd motion simplifies the process of crowd behaviour understanding/recognition and crowd anomaly detection [1]–​[4] (the terms motion pattern segmentation and group detection are used interchangeably and have the same meaning in the context of this article). In spite of various efforts [5]–​[10], precise motion pattern segmentation remains a challenging task due to varying crowd dynamics across different types of scenes and intricate interactions between the pedestrians within a scene. In general, a crowded scene can be either structured or unstructured. In a structured crowded scene, the direction of motion of the crowd remains same for most of the time and is easily predictable. Whereas, in an unstructured crowded scene, the direction of motion of the crowd changes chaotically and is unpredictable. Hence, a model which is designed specifically for one type of crowded scene need not be efficient with the other. Most of the attempts to tackle the problem of motion pattern segmentation perform less efficiently when the type of scene changes, which results in wrongly detected segments.

In this article, we propose a scene-independent motion pattern segmentation approach by using the spatial as well as angular features of a trajectory and subsequently applying an improvised density-based clustering algorithm to group trajectories exhibiting similar motion patterns. The input trajectories generated by using the generalized KLT (gKLT) based key-point tracker [7] are averaged for a given period of time. The average spatial location and average angular orientation of each trajectory constitute its spatial and angular features, respectively. The spatio-angular features are then used to generate spatial information in terms of k-nearest neighbours of each trajectory and angular information in terms of angular deviation between trajectories. Finally, the spatio-angular information is used by the density-based clustering algorithm to group similar motion patterns. For simplicity, we denote our approach as SADC (Spatio Angular Density-based Clustering).

The proposed SADC algorithm is evaluated on the publicly available standard CUHK dataset [8] containing different types of scenes of varying densities and perspectives with ground truth available for motion pattern segmentation. For analysis of performance across different kinds of scenes, we divide the CUHK dataset into meaningful scene categories. Comparison of our results with the state-of-the-art motion pattern segmentation techniques shows that the proposed SADC approach is efficient and computationally faster than other techniques.

The four main contributions of this article are: (i) a faster, efficient and scene-independent method for grouping similar motion patterns, (ii) an averaging-based approach to extract meaningful features of crowd motion, (iii) an improvised density-based clustering algorithm where cluster membership is decided based on closeness of spatial and angular features of a trajectory when compared to other trajectories and (iv) a detailed analysis across different scene categories, which proposes a range of parameter values applicable for each scene category.

SECTION II.

Related Work

There have been numerous attempts to improve the efficiency of the process of motion pattern segmentation in crowded scenes starting from the Lagrangian Particle Dynamics based approach proposed by Ali and Shah [5]. A concise review of these approaches can be found in [11]. According to [11], these approaches have been divided into three categories: flow field model-based, similarity-based, and probability model-based approaches. Most of these approaches either use homogeneous datasets (in terms of scene dynamics) or they create ground truth which are in-line to their proposed method. In this article we use a similarity-based clustering approach which is evaluated over the CUHK dataset, containing diverse real-time scenarios along with ground truth for motion pattern segmentation. Hence, the remainder of this section discusses about: (i) the notable and recent efforts towards efficient motion pattern segmentation/group detection using CUHK dataset, (ii) the approaches which are related to similarity-based trajectory clustering.

It was Zhou et al. [7] who initially created a Collective motion dataset in order to measure the collective behaviour within the crowd. In their proposed approach, an algorithm called Collective Merging (CM) was used to find coherent groups of trajectories within the crowd. The CM algorithm models (i) the local behaviour of the crowd using a weighted k-Nearest Neighbour (k-NN) graph and (ii) the global behaviour among the non-neighbours by finding similar paths in the k-NN graph. Subsequently Shao et al. [8] created the CUHK dataset from the Collective Motion dataset by adding new scenes and defining ground truth for the detection of similar motion patterns/groups within a scene (more details about CUHK can be found in Section IV-A). In their work, a Collective Transition (CT) algorithm was proposed for group detection. The CT algorithm improved an earlier approach for group detection based on Coherent Filtering (CF) [6], by modelling the trajectory data using Markov Chains. Both the CF and CT based algorithms used the concept of Coherent Neighbour Invariance (initially introduced in [6]) to keep track of the persistent members within a group over the course of time.

In another interesting work, Wang et al. [9] used a thermal diffusion-based model to generate a strong coherent motion field (known as the thermal energy field) from the noisy optical flow field computed from the input crowd video. Triangulation-based boundary detection, watershed segmentation algorithm and two-step graph-based clustering strategy were applied over the thermal energy field to cluster coherent motion regions. Trojanova et al. [12] adopted a weighted k-NN graph-based clustering approach, which used a data-driven threshold as compared to a static threshold in [7]. Fan et al. [13] used a Natural Nearest Neighbour algorithm (3N) to adaptively determine the optimal number of the nearest neighbours (the k-value) as compared to k-NN based approach where the k-value needs to be experimentally determined. In their work, the 3N algorithm generates a crowd motion network from which similar motion patterns are detected using the concept of coherent neighbour invariance. A different approach called Hybrid Social Influence Model (HSIM) was proposed by Ullah et al. [10], which used a density-independent version of the Social Force Model [14], [15] to model crowd motion and Communal model [16] to group similar motion patterns. The topic models which are popular in language processing, have been used by Chen et al. [17] for group detection. In their proposed approach, after dividing the input crowd image into a fixed number of patches (using a Simple Linear Iterative Clustering algorithm), a descriptor was computed for each patch by combining the feature points generated by the gKLT tracker and orientation distribution of each feature points within the patch. The Latent Dirichlet Allocation-based model is then combined with the Markov Random Field to determine the topics. Based on these topics the features are grouped together. In a recent work, Wang et al. [18] proposed a self-weighted multiview clustering approach that combines an orientation-based graph and a structural context-based graph and apply a tightness-based merging strategy to detect groups within the crowd.

The approaches in [6]–​[9] detect groups frame-by-frame which results in group-switching (or cluster-switching) for each frame. Motion pattern segmentation being a prior to crowd behavioural monitoring/ anomaly detection process, regular switching of groups/clusters creates inconsistent results for these processes. Our approach is similar to He and Liu [19] who used a Density-based Clustering algorithm to perform motion pattern analysis in crowded scenes. However, there are a couple of differences in the approaches: (i) to extract and represent the motion information from input crowded video scene, we compute the averaged trajectory generated from an efficient and accurate KLT tracker [6], [20] compared to He et al. who used a global optical flow field computed from the traditional Lucas Kanade-based approach [21], (ii) to determine the spatial proximity, we employ a nearest neighbour-based strategy compared to the Euclidean-distance-based thresholding approach which is scene-dependent and fails when the scene-perspective changes.

The task of motion pattern segmentation can also be considered as a trajectory clustering problem. Trajectory clustering have been extensively researched and the recent surveys on moving object trajectory clustering methods presented in [22]–​[24] point out the challenges faced. Finding a suitable measure to compute the similarity among trajectories with varied properties and finding a suitable algorithm to cluster the trajectories based on their similarities are the two main challenges in trajectory clustering. Various similarity measures [25]–​[28] and clustering algorithms [18], [19], [29]–​[32] have been used for trajectory based motion pattern analysis. In our approach we average the trajectories and utilise the averaged vector information to perform clustering.

Most of the motion pattern segmentation approaches discussed so far are either applicable to specific categories of crowded scene, or suffer from excessive cluster switching or are computationally intensive. This article proposes a spatio-angular density-based clustering approach which is efficient and stable for different scene categories, which has minimal cluster switching and is computationally faster.

SECTION III.

Spatio-Angular Density-Based Clustering (SADC) for Motion Pattern Segmentation

This article considers scenarios involving sparse to dense crowds in shopping malls, train stations, escalators, street/sidewalk/market, crosswalks, road-traffic, marathon, military-parade, public events, other indoor and outdoor scenes which are under surveillance for crowd management using static cameras. According to prior research [7], [33], the social behaviour of people walking in groups is such that all the members within the group tend to move in same direction and are spatially close to each other. Based on these scenarios one can infer that a set of spatially close motion patterns are considered to be in the same group, if they move together in the same direction. This work proposes a spatio-angular density-based clustering for detecting such coherent groups. The block diagram shown in Figure 1 illustrates the three phases of proposed approach, namely, (i) Extraction of motion information, where motion information in the form of trajectories (tracks) is extracted from the input video by tracking key-points using a gKLT Tracker, (ii) Computation of Angular & Spatial Information from the motion features, where the motion features in the form of average angular orientation and average spatial location (computed from each of the trajectories) are used to create angular and spatial information matrices, respectively, (iii) Improvised Density-based clustering, which generate clusters containing similar motion patterns by considering the angular and spatial information. The proposed approach is explained in detail in the following sections.

FIGURE 1. - Overview of the proposed approach where, 
$n$
- no. of trajectories, gKLT Tracker - Generalized KLT Tracker, 
$\lbrace t_{n} \rbrace $
- Set of trajectories, 
$\lbrace \overline {\theta }_{t_{n}} \rbrace $
- Set of averaged angular orientations, 
$\lbrace \left ({\overline {x}_{t_{n}}, \overline {y}_{t_{n}}}\right) \rbrace $
- Set of averaged spatial location co-ordinates, 
$[A]_{n \times n}$
- Angular deviation matrix, 
$[S]_{n \times n}$
- Nearest Neighbour Matrix.
FIGURE 1.

Overview of the proposed approach where, n - no. of trajectories, gKLT Tracker - Generalized KLT Tracker, \lbrace t_{n} \rbrace - Set of trajectories, \lbrace \overline {\theta }_{t_{n}} \rbrace - Set of averaged angular orientations, \lbrace \left ({\overline {x}_{t_{n}}, \overline {y}_{t_{n}}}\right) \rbrace - Set of averaged spatial location co-ordinates, [A]_{n \times n} - Angular deviation matrix, [S]_{n \times n} - Nearest Neighbour Matrix.

A. Extraction of Motion Information

Given an input video, the first step is to capture the motion of the crowd. This is usually done by tracking a set of key-point features across the frames in the video which results in the generation of trajectories (or tracks) [34]. The problem of extracting motion information is now a tracking problem. The proposed approach uses the generalized KLT tracker (gKLT tracker [7] which is derived from the traditional KLT tracker [20]) because of its tracking accuracy and computational efficiency. Each of the generated trajectories is a combination of 2-D spatial co-ordinates \lbrace (x_{1},y_{1}), (x_{2},y_{2}),\ldots.,(x_{m},y_{m})\rbrace , where T=1:m . The key-points which are tracked may lie in the background or generated due to illumination variations resulting in the generation of short, static & noisy trajectories. In our approach, such trajectories are filtered out using the method by Shao et al. [8] which improves the overall accuracy of the motion pattern segmentation. The following subsection explains the next phase, where the refined trajectories are used to extract motion features from which angular and spatial information are generated.

B. Computation of Angular and Spatial Information From the Motion Features

Two types of motion features are extracted from the refined trajectories namely angular orientation and spatial location. The angular orientation feature gives the direction of motion of a crowd. If we consider only the directions of motion patterns P & Q (as shown in Figure 2(a) & 2(b)) as a feature vector, it gets clustered as a same set of motion trajectories. In reality, they should form different set of trajectories. Hence, this work introduces a second motion feature, the spatial location, that enables to identify trajectories that are spatially distant from each other.

FIGURE 2. - (a) A frame from a crowded video scene, (b) Two motion patterns P & Q, (c) P & Q represented in space with arrow pointing towards the direction of motion, (d) The directions of the motion patterns P & Q represented as angles 
$\theta _{P}$
 & 
$\theta _{Q}$
 respectively in the polar co-ordinate system (Best viewed in color).
FIGURE 2.

(a) A frame from a crowded video scene, (b) Two motion patterns P & Q, (c) P & Q represented in space with arrow pointing towards the direction of motion, (d) The directions of the motion patterns P & Q represented as angles \theta _{P} & \theta _{Q} respectively in the polar co-ordinate system (Best viewed in color).

Since the length of the trajectories are not same, computing the pair-wise similarities between them requires normalization of trajectory lengths. Averaging the trajectory data is not only an effective way to normalize the length of the trajectories but also enables to capture the behaviour of the trajectories over time. The refined trajectory data is averaged before computing the two motion features as suggested in [12]. If the trajectory data for a trajectory t_{i} is represented as t_{i}= \lbrace (x_{1},y_{1}), (x_{2},y_{2}),\ldots.,(x_{m},y_{m})\rbrace , where m is the length of the trajectory, its average spatial location feature (\bar {s_{t_{i}}} ) is computed as follows:\begin{equation*} \bar {s_{t_{i}}}= \frac {1}{m} \sum _{k=1}^{m}(x_{k_{t_{i}}},y_{k_{t_{i}}}) \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

To obtain the average angular orientation feature, the average displacement (\overline {V}_{t_{i}} ) of a trajectory is computed first as follows:\begin{equation*} \overline {V}_{t_{i}}= \frac {1}{(m-1)} \sum _{k=2}^{m} \left ({(x_{k_{t_{i}}} - x_{(k-1)_{t_{i}}}),(y_{k_{t_{i}}} - y_{(k-1)_{t_{i}}}) }\right) \tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where, the average displacement vector, \overline {V} = [\overline {u},\overline {v}] , consists of an x -component (\overline {u} ) and a y -component (\overline {v} ). The average angular orientation feature (\overline {\theta }_{t_{i}} ) is then computed as follows:\begin{align*} {\overline {\theta }_{t_{i}}} = \begin{cases} \displaystyle cos^{-1} \left ({\frac {\overline {V}. \hat {V}}{||\overline {V}||*||\hat {V}||} }\right)* \frac {180}{\pi }, & \overline {v}>0 \\ \displaystyle \left [{\! 2\pi \!-\!cos^{-1} \!\left ({\!\frac {\overline {V}. \hat {V}}{||\overline {V}||*||\hat {V}||} \!}\right) \!}\right] * \frac {180}{\pi }, & \overline {u}\neq 0, \overline {v}\leq 0\\ \displaystyle 0, & \overline {u},\overline {v}=0 \end{cases}\!\!\!\!\!\!\!\!\!\!\! \\ \tag{3}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where \hat {V}=[{0,1}] is the unit vector in the horizontal direction. The value of \overline {\theta }_{t_{i}}\in (0,2\pi -1) varies according to the values of the vectors \overline {u} and \overline {u} and is equal to zero when there is no motion. Computing the average angular orientation in this manner i.e., orientation value obtained from averaged displacement vectors rather than the orientation value obtained by averaging the orientations of the individual displacement vectors, enables to implicitly encode the effect of magnitude information as well. In the case of abrupt changes in the direction of a trajectory, computing average angular orientation for the entire path leads to erroneous motion features. By using a fixed set of frames (rather than the entire set of frames containing a trajectory) for computing both the angular and spatial features, the error incurred is reduced. The number of frames considered for averaging is determined empirically.

The obtained motion features (\bar {s_{t_{i}}} & \overline {\theta }_{t_{i}} ) are then used to generate matrices containing angular and spatial information respectively. The angular information matrix ([A]_{n \times n} ) contains the pair-wise angular deviation between average angular features of two trajectories t_{i} & t_{j} . The distance function (‘d_{angular} ’) used to compute the angular deviation between the average angular features \overline {\theta }_{t_{i}} & \overline {\theta }_{t_{j}} is defined as follows:\begin{equation*} d_{angular}(\overline {\theta }_{t_{i}},\overline {\theta }_{t_{j}}) = min(|\overline {\theta }_{t_{i}}-\overline {\theta }_{t_{j}}|, 2\pi - |\overline {\theta }_{t_{i}} - \overline {\theta }_{t_{j}}|) \tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where the function d_{angular} \in (0,\pi) overcomes the issues with computing distance (angular deviation) in circular domain.

On the other hand, the spatial information matrix conveys whether the trajectories are spatially close to each other and is constructed by finding the k-Nearest Neighbours for each trajectory. Pair-wise Euclidean distance between the average spatial location features is used to find the nearest neighbours. The spatial information-based matrix is defined as follows:\begin{align*} S(i,j) = \begin{cases} 1, & \text {if } t_{j} \in N(t_{i})\\ 0, & \text {otherwise} \end{cases} \tag{5}\end{align*}

View SourceRight-click on figure for MathML and additional features. where N(t_{i}) consists of a set of k-Nearest Neighbours of the trajectory t_{i} . The value of ‘k’ depends on the number of trajectories (n ) in a scene and is computed as follows:\begin{equation*} k = \alpha * n \tag{6}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \alpha ~\in [0, 1] denotes how much percentage of the total number of trajectories must be considered to determine the value of k . Using a density-based clustering algorithm, the angular and spatial information matrices are then combined to form clusters of similar trajectories.

C. Improvised Density-Based Clustering

The obtained angular and spatial information depicts how the motion patterns are spread across the scene. Finding similar motion patterns and grouping them (using the obtained information) becomes a clustering problem. The crowded scenes can contain motion patterns of arbitrary shape/size and can have motion patterns which does not belong to any group (noisy data). A density-based clustering algorithm would be an obvious choice in such cases. In the proposed approach, we use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. As proposed by Ester et al. [35], DBSCAN consists of two parameters: (i) the Eps -neighbourhood and (ii) MinPts parameter. In this approach, the Eps-neighbourhood value, which is essentially a threshold, determines the maximum angular deviation that is allowed between the average angular features of two trajectories. We improvise the Eps-neighbourhood criteria by using an angular threshold (Eps -neighbourhood value is called as \lambda _{\theta } in our work) and then combine it with the spatial information to decide the cluster membership. The inclusion of spatial information is crucial, because it removes the distant trajectories with orientation less than the angular threshold (\lambda _{\theta } ). Further, the inclusion of spatial information makes our the proposed motion pattern segmentation algorithm resistant to scene perspective changes. Therefore, the trajectories which have similar orientation (defined by the \lambda _{\theta } parameter) and are spatially close to each other (defined by the \alpha parameter) belong to the same cluster. This is consistent with our definition in Section III. The proposed approach along with the improvised section of DBSCAN is shown in Algorithm 1.

Algorithm 1 Spatio-Angular Density-Based Clustering (SADC)

Input:

Set of ‘n ’ trajectories, \lbrace t_{n} \rbrace from a crowded scene.

Output:

m ’ clusters, where each cluster contains similar motion patterns

for k = 1 to n do

1:

Compute \overline {\theta }_{t_{k}} and \overline {s}_{t_{k}} using Eq. 3 and Eq. 1, respectively.

2:

Compute the matrices A_{k \times k} from \lbrace \overline {\theta }_{t_{k}} \rbrace and S_{k \times k} from \lbrace \overline {s}_{t_{k}} \rbrace using the distance functions defined in Eq. 4, Eq. 5 and Eq. 6, respectively.

3:

Perform DBSCAN to obtain ‘m ’ clusters of similar motion features:

for each (i,j) in A and S do

if A(i,j) \leq \lambda _{\theta } and S(i,j)==1 then

Assign t_{j} to t_{i} ’s cluster.

end if

end for

SECTION IV.

Experimental Results and Discussion

A. Dataset

CUHK dataset [8] is the only publicly available standard dataset with ground truth for motion pattern segmentation. The dataset contains a total of 474 videos captured from 215 scenes out of which 300 videos are used for the purpose of motion pattern segmentation. The proposed approach is evaluated on these 300 videos captured across different types of scenes. For the purpose of analysis, these videos are divided (manually) into different categories based on (i) the property of crowd movement as (a) structured, (b) unstructured, (ii) the location of the scene as (a) Indoor, (b) Outdoor and (iii) the type of the scene as (a) cross walk, (b) escalator, (c) mass movement, (d) market (e) public walkway, (f) shopping mall, (g) street. Figure 3 and Figure 4 shows example scenes and statistics, respectively for this categorization.

FIGURE 3. - Example scenes for (a) cross walk, (b) escalator, (c) market, (d) mass movement, (e) public walkway, (f) shopping mall, (g) station, and (h) street. Scenes (d) and (f) are good examples for structured and unstructured, indoor and outdoor scenes respectively.
FIGURE 3.

Example scenes for (a) cross walk, (b) escalator, (c) market, (d) mass movement, (e) public walkway, (f) shopping mall, (g) station, and (h) street. Scenes (d) and (f) are good examples for structured and unstructured, indoor and outdoor scenes respectively.

FIGURE 4. - CUHK dataset categorization statistics. It can be observed that most of the Mass Movement scenes are Structured. Otherwise, majority of the scenes within the dataset are Unstructured.
FIGURE 4.

CUHK dataset categorization statistics. It can be observed that most of the Mass Movement scenes are Structured. Otherwise, majority of the scenes within the dataset are Unstructured.

B. Experimental Set-Up

The experiments are performed, over all the 300 scenes of CUHK-dataset, on a personal computer with Intel Core i5-8400@2.80 GHz processor and 8 GB RAM. Firstly, we replicate the experimental set-up used by Shao et al. [8] to extract motion information. The gKLT tracker is initialized with a set of 3000 key-points which are tracked across the frames in the video. Short/static/noisy trajectories are filtered out by discarding those having their trajectory length less than 10-frames & those with zero-displacement vectors for more than half of its duration. Secondly, for the purpose of grouping, trajectory data from multiple frames (30-frames) are considered in our approach. By doing so, we aim to obtain more information on the history of the path followed by a trajectory. Finally, for the purpose of clustering, the values for angular threshold & the \alpha -value for determining the spatial threshold, are chosen empirically (\lambda _{\theta } = 20^{\circ } , \alpha =15 ), whereas the value for minimum points to form a cluster is set to MinPts = 5 .

C. Results

To prove the effectiveness of the proposed approach, its performance is compared with the state-of-the-art approaches CM [7] and CT [8], whose binaries are publicly available. Note that, we have not considered CF for comparison because the CT algorithm (which is considered) is an improvement over CF. We have kept the original setting for the parameters of CM and CT. Since both the approaches generate clusters for each frame, for the purpose of evaluation and comparison, we choose the clusters generated at a particular frame (for each scene) as defined in the CUHK-dataset and align the proposed approach accordingly.

1) Qualitative Results

The category-wise qualitative results are shown in the Figure 5. It can be observed that the proposed SADC-algorithm performs well in all categories of the CUHK dataset even in case of complex scenes (Figure 5, columns 5-8). In fact, most of the clusters generated by the SADC-algorithm are more closer to the ground truth clusters when compared to CM [7] and CT [8]. The CM-algorithm uses an graph-based approach which relies on finding similar weighted paths and a connected component-based algorithm which clusters the similar nodes on the graph based on a threshold. While the path-based approach is effective, the less effective connected component-based clustering does not generate accurate clusters when scene-type changes. Also, the CM algorithm is dependent on the K -parameter value, which is another reason for the inconsistent results. The CT-algorithm improves the CF-algorithm using a Markov-chain based approach. Since, the CT-algorithm is built on top of the CF-algorithm, it is partly-dependant on the output of the CF-algorithm due to which some trajectory-points belonging to a particular cluster in the ground truth are considered as noise by the CT-algorithm. This is because, the trajectory key-points which are labelled as noise by CF-algorithm (not-included in the clustering result) is not considered for building the Markov-chain and remain as noise (red-color points). Furthermore, CM, CF and CT uses a frame-by-frame approach to generate the clusters. Therefore, for each pair of frames, only the key-points vectors at that instance is considered for clustering. Due to this, the cluster members keep changing and results are inconsistent across the duration of the video, which is another reason for the occurrence of noise key-points in Figure 5. Whereas, the proposed SADC uses an averaging-based approach that extracts the history of the trajectory (for a fixed set of 30-frames) in terms of average angular and spatial features. This results in the reduction of noise key-points (as seen in Figure 5) which leads to consistent clustering with minimal cluster switching.

FIGURE 5. - Qualitative comparison between the proposed approach (SADC), CM [7] and CT [8] across the eight-categories of CUHK-dataset. The trajectory-points with similar motion patterns have same color. Red-color represents noise points (Best viewed on screen).
FIGURE 5.

Qualitative comparison between the proposed approach (SADC), CM [7] and CT [8] across the eight-categories of CUHK-dataset. The trajectory-points with similar motion patterns have same color. Red-color represents noise points (Best viewed on screen).

2) Quantitative Results

For quantitative evaluation, motion pattern segmentation is considered as a clustering problem and the performance evaluation done using four widely used external cluster evaluation measures such as Purity [36], Normalized Mutual Information (NMI) [37] & Rand Index (RI) [38]. All the performance measures are in the range of [0, 1], where a higher value indicates a better clustering performance. The comparison results in Table 1 shows the average performance for 300 scenes of CUHK-dataset and indicates that the proposed approach outperforms the CM and CT-based approaches.

TABLE 1 Quantitative Comparison of Proposed Approach With the State-of-the-Art Approaches (Averaged for 300 Scenes)
Table 1- 
Quantitative Comparison of Proposed Approach With the State-of-the-Art Approaches (Averaged for 300 Scenes)

Notice that, the NMI value for CT is lesser than CM, but CT performs better than CM for other metrics. On inspection, we found that the NMI-metric always generates NMI = 0 if either one of the two clustering assignment (ground truth or clustering result) has a single cluster but the other clustering assignment has more than one cluster, which is not desirable. This is due to an underlying mathematical issue with the computation of Mutual Information, the discussion of which is beyond the scope of this article. Out of 300 scenes in CUHK dataset, 25% of scenes are having one-cluster ground truth. This is one of the main reason for other literatures reporting a lower value of NMI when evaluated over the CUHK dataset.

Figures 6 & 7 shows the comparison results over different scene categories as defined in Section IV-A. For fair evaluation, we have considered equal number of scenes for each scene-category by randomly sampling the scenes. The proposed SADC algorithm performs not only the best among the three algorithms but also performs above par in all scene categories. This is because of the fact that, (i) SADC considers the history of the trajectory for multiple frames, which plays a crucial role during clustering, (ii) SADC computes the angular deviation between the averaged trajectory vectors and is well complemented by the k-nearest neighbour method, both of which are effective inputs to perform efficient density-based clustering for different scene categories. The NMI issue, as discussed earlier, is clearly evident from Figure 6 & 7, where the Mass Movement scene category has average NMI value close to zero for CM and CT and low for SADC (in this case, the value for RI-metric can be considered). Mass Movement category involves scenes like Marathon, Military Parade, Protest Rallies whose ground truth, in most cases, contains only one motion pattern.

FIGURE 6. - Quantitative comparison result based on type of scene (Best viewed in color).
FIGURE 6.

Quantitative comparison result based on type of scene (Best viewed in color).

FIGURE 7. - Quantitative comparison result based on - Left: Property of crowd movement and Right: Scene location (Best viewed in color).
FIGURE 7.

Quantitative comparison result based on - Left: Property of crowd movement and Right: Scene location (Best viewed in color).

D. Parameter Analysis

As discussed in Section III-B & III-C, the proposed approach depends on two parameters, namely \lambda _{\theta } (works on angular information) & \alpha (works on spatial information). Figure 8 & 9 demonstrates the effect of varying these parameters. Figure 8 shows the effect of combining the spatial information along with angular information. Among the two profiles seen in Figure 8, the one which is generated after including both spatial and angular information is better than the other profile where only angular information is included (discussed briefly in Figure 2). This asserts the need to combine spatial information with angular information. It can also be observed that, the performance of the proposed algorithm gradually decreases with increase in \lambda _{\theta } value which means that a larger value for \lambda _{\theta } would result non-similar motion patterns being grouped together.

FIGURE 8. - Average NMI, Purity and Rand Index profiles respectively (from left to right) with respect to varying 
$\lambda _{\theta }$
 values for 300 scenes of CUHK dataset (Best viewed in color).
FIGURE 8.

Average NMI, Purity and Rand Index profiles respectively (from left to right) with respect to varying \lambda _{\theta } values for 300 scenes of CUHK dataset (Best viewed in color).

FIGURE 9. - Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying 
$\lambda _{\theta }$
 and 
$\alpha $
 values for 300 scenes of CUHK dataset (Best viewed in color).
FIGURE 9.

Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying \lambda _{\theta } and \alpha values for 300 scenes of CUHK dataset (Best viewed in color).

Figure 9 indicates the performance profiles of SADC algorithm with various combinations of \lambda _{\theta } & \alpha . The similar \lambda _{\theta } profiles for varying values of \alpha indicates the dominance of angular component over the spatial component in making decisions regarding clustering. Although the profiles look similar, the one with \alpha = 15 is observed to be more stable than the others. Also, it can be noticed that \lambda _{\theta } value between 10° and 30° performs better on an average.

Further, in Figures 10 & 11 we report the performance profiles for each of the scene categories by keeping alpha constant (\alpha = 15 ) and varying \lambda _{\theta } value. For this purpose, we consider all the scenes in each category without any sampling. Apart from the NMI-profile for mass-movement, structured and outdoor scene (which contain scenes with one-cluster ground truth) all the other profiles show the actual trend. The experimental results in section IV-C were obtained by applying a single threshold (\lambda _{\theta } = 20^{\circ } , \alpha = 15 ) for all the scenes of CUHK dataset. Even though the obtained results for a single threshold are well above par, in real world scenes the threshold must depend on the type of scene. Therefore, with the help of the profiles in Figures 10 & 11 we point out the suitable values for \lambda _{\theta } across different scene categories (\lambda _{\theta } plays a major role in deciding cluster membership compared to \alpha ). The following inferences can be made about the choice of \lambda _{\theta } for SADC algorithm when used for analysing crowded scenes: (i) For most cross walk and some mass movement scenes, \lambda _{\theta } between 10° and 45° is suitable. But, for most of the single-cluster mass movement scenes which are also structured and outdoor, the angular deviation between the trajectories would be very low and hence a higher \lambda _{\theta } value greater than 20° is suitable because a lower \lambda _{\theta } value could result in splitting of similar clusters. (ii) For escalator scenes (indoor), \lambda _{\theta } between 10° and 35° is desirable. Increasing the \lambda _{\theta } value will result in the inclusion of non-escalator trajectories to escalator’s cluster. (iii) For other complex and unstructured scenes such as market, public walkway, shopping mall (indoor), station (indoor) and street (where the number of motion patterns/groups are high), a lesser \lambda _{\theta } value will enable the SADC to capture more groups. Hence, for such scenes, a \lambda _{\theta } value between 10° and 20° is desirable.

FIGURE 10. - Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying 
$\lambda _{\theta }$
 value and 
$\alpha =15$
, for scene categorization based on type of scene (Best viewed on screen).
FIGURE 10.

Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying \lambda _{\theta } value and \alpha =15 , for scene categorization based on type of scene (Best viewed on screen).

FIGURE 11. - Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying 
$\lambda _{\theta }$
 value and 
$\alpha =15$
, for scene categorization based on property of crowd movement and scene location (Best viewed in color).
FIGURE 11.

Average NMI, Purity and Rand Index profiles respectively (from left to right) for with respect to varying \lambda _{\theta } value and \alpha =15 , for scene categorization based on property of crowd movement and scene location (Best viewed in color).

E. Computational Complexity

Time taken to generate the motion pattern segmentation results play an important role, especially in real-time scenarios. To calculate the time taken, we use the setting as mentioned in Section IV-B and report the average time taken (in seconds) to generate the clustering result for 30-frames over the entire 300 scenes of CUHK dataset. Note that we do not include the time taken to generate the trajectories as they are the same for all three methods (CM, CT, SADC). Both CM and CT need to iterate frame-by-frame to perform clustering, whereas the proposed SADC algorithm generates clusters for every 30-frames. Therefore, the time taken by SADC to generate clusters is significantly faster than the other two methods, which is substantiated in Table 2.

TABLE 2 Average Time Taken (in Seconds) to Generate Clustering Result for 30-Frames
Table 2- 
Average Time Taken (in Seconds) to Generate Clustering Result for 30-Frames

SECTION V.

Conclusion and Future Work

This article proposed a spatio-angular density-based clustering approach to group similar motion patterns. Spatial and angular features are obtained by averaging the trajectories across a fixed set of frames to capture better information about the history of the trajectories. These two features are then utilized to facilitate an improvised density-based clustering algorithm where the cluster membership is decided on the basis of two parameters, namely angular deviation threshold and spatial threshold. Qualitative and quantitative analysis through a set of parameters for different scene categories have shown the robustness of the proposed algorithm.

In the future work, we plan to integrate the crowd anomaly detection framework with our proposed approach. Furthermore, we intend to explore more on the inclusion of new features to tackle complex scenarios.

References

References is not available for this document.