Spatial Clustering Approach for Vessel Path Identification

This paper addresses the challenge of identifying the paths for vessels with operating routes of repetitive paths, partially repetitive paths, and new paths. We propose a spatial clustering approach for labeling the vessel paths by using only position information. We develop a path clustering framework employing two methods: a distance-based path modeling and a likelihood estimation method. The former enhances the accuracy of path clustering through the integration of unsupervised machine learning techniques, while the latter focuses on likelihood-based path modeling and introduces segmentation for a more detailed analysis. The result findings highlight the superior performance and efficiency of the developed approach, as both methods for clustering vessel paths into five clusters achieve a perfect F1-score. The approach aims to offer valuable insights for route planning, ultimately contributing to improving safety and efficiency in maritime transportation.


A. Background and Motivation
T HE maritime transportation is crucial for global trade, generating extensive vessel trajectory data that reveals intricate spatial and temporal navigation patterns.Understanding these patterns is vital for effective maritime traffic surveillance and management [1].
A trajectory refers to a sequence of consecutive geographical points, each representing a specific location at a given timestamp [2].Trajectories, observed in various scenarios such as pedestrian movements, vehicular routes, and natural events like wildlife migrations or hurricanes, involve time-evolving position data.Trajectory mining aims to uncover significant patterns within datasets, enabling tasks like path classification, anomaly detection for accidents or traffic congestions, surveillance for suspicious activities, and prediction of vessel trajectories in different landscapes [3].
It is crucial to distinguish between trajectory and path when studying movement data.The term trajectory typically denotes the movement of an object over time, while path refers to the specific route taken by the object.For instance, if an object travels from origin to destination, its trajectory is the sequence of geographical points it passes through, while its path is the specific route taken, such as a sidewalk, street, highway, lane, railway, or waterway.
Additionally, the term route is synonymous with path or trajectory as long as it shares the same origin and destination.Conversely, any difference in either the origin or destination represents another route.
Path clustering, a versatile technique, involves grouping paths into clusters based on their similarity.This approach has found its niche in a wide range of practical applications.In the realm of navigation, path identification empowers systems to generate clear and detailed instructions for users seeking their way.Traffic analysis benefits from path clustering as it facilitates the identification of diverse traffic patterns, such as the smooth flow of traffic on highways and the congestion often encountered on city streets.Path identification proves equally valuable in route planning, enabling the optimization of routes for transportation systems, including public and maritime transportation services [4].
In the scope of the maritime industry, path identification from Automatic Identification System (AIS) data is a challenging task due to the high spatial freedom and, especially in coastal areas, the high frequency of ship's navigation maneuvers.Thus, it is imperative to develop a path identification tool that integrates with route planning systems for improving maritime safety and optimizing vessel routing.As data-driven approaches from AIS data continue to grow and evolve, path clustering will undoubtedly play an increasingly important role in understanding vessel behavior and supporting decisionmaking in maritime transportation [5].

B. Aims and Contributions
This paper aims to address the challenge of identifying vessel paths in scenarios characterized by repetitive, semirepetitive, and novel operations.In general, the aims and contributions of the proposed approach in this paper can be outlined as follows: • The proposed clustering approach of vessel paths requires only position information, specifically longitude and latitude.
• The clustering approach has a proven added value for clustering challenging unseen or unknown paths.
• The approach is robust and interpretable by applying a similarity measure that reduces the influence of noise or outliers and offers a clear interpretation of path clustering.
• The approach has a customizable parameter to determine the number of path classes, thereby enhancing the flex-ibility and adaptability of the framework and allowing users to tailor it to their specific needs.
• The approach also includes a method to study and analyze the patterns within specific segments of a path.
• It is a data-driven solution that can be used as a valuable asset for informed decision-making in route planning and optimization, traffic management, and resource allocation.The rest of this paper is organized as follows.Section II reviews the related work on vessel path identification.Section III describes our proposed spatial clustering approach in detail.The real-world vessel data of our case study is described in Section IV.Section V presents the results of our experimental evaluation.Section VI concludes the paper and discusses future work.

II. RELATED WORK
Path clustering can be done using a variety of different methods [6].We will explore the related works of these various methods.
Clustering is gaining popularity for route extraction.Machine learning (ML) has recently been applied extensively for vessel path identification by learning patterns from historical data.The study presented in [7] introduced a framework, Traffic Route Extraction and Anomaly Detection (TREAD), which utilizes unsupervised learning for maritime route extraction.The primary emphasis is on anomaly detection and route prediction, highlighting the crucial role of AIS data in enhancing maritime situational awareness.The work specifically addresses challenges related to intermittency and persistence in AIS data.Another method of route extraction was proposed in [8], transforming ship trajectories into ship trip semantic objects (STSO) and utilizing graph theory for route extraction.The method proves robustness in extracting traffic routes for merchant ships but may have limitations for vessels with frequent navigation behavior changes, such as fishing vessels.The approach in [9], on the other hand, adopts a dynamic time warping (DTW) distance as a similarity measure and considers vessel course changes to analyze its trajectories.Experiments demonstrated its high accuracy in distinguishing and detecting similar vessel trajectories, outperforming existing methods in accuracy and cluster degree evaluation.Moreover, [10] presents a machine learning framework for maritime vessel trajectory analysis, incorporating clustering, classification, and outlier detection.It employs piecewise linear segmentation for compression and alignment kernels to integrate geographical domain knowledge, enhancing task performance.Results show reduced computation time without compromising accuracy.
Capobianco et al. [11] proposed a deep learning approach using recurrent neural networks, employing a Bidirectional Long Short-Term Memory (BiLSTM) layer as an encoder and a Unidirectional Long Short-Term Memory (LSTM) layer as a decoder, for vessel trajectory prediction.Their model outperforms baseline approaches, showcasing the effectiveness of sequence-to-sequence neural networks.In their study, Li et al. [12] present an AIS data-based machine learning method for feature extraction and unsupervised route planning for Maritime Autonomous Surface Ships (MASS).The method uses Automatic and Adaptive Dynamic Time Warping (AADTW), Spectral Clustering with Affinity Feature (SCAF), and a route optimization algorithm based on dynamic programming to extract features, obtain movement patterns, and plan routes.The proposed method outperforms existing methods by considering the impact of hidden factors and providing different routes for different types of MASS.The work in [13] systematically analyzes the performance of twelve ship trajectory prediction methods, including classical machine learning and emerging deep learning techniques.It compares twelve methods across three AIS datasets, representing different maritime traffic scenarios, and evaluates their effectiveness based on six indexes.The study concludes that traditional machine learning-based trajectory prediction methods struggle to meet the rising demands for accuracy and real-time performance, leading to increased interest in and promising results from deep learningbased approaches.
A maritime traffic route extraction approach based on multidimensional density-based spatial clustering of applications with noise (MD-DBSCAN) was developed in [14].The approach incorporates trajectory compression, similarity measures, and extraction of ship trajectory clusters.The approach demonstrates effectiveness in noise reduction and route extraction.The authors in [15] proposed a trajectory clustering method based on Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) and Hausdorff distance to generate a similarity matrix.The method adapts to shape characteristics and exhibits good clustering scalability and improved clustering results compared to DBSCAN, kmeans, and spectral clustering algorithms.
Eljabu et al. proposed spatial clustering methods (SPT-CLUST and SPTCLUST-II) in [16] and [17] respectively, for maritime traffic routes extraction from AIS data.The approach consists of data preprocessing, pathfinding, and route extraction without using traditional clustering algorithms.It achieved high F1-scores, 97% and 99%, for tankers and cargo maritime traffic routes.
The study in [18] enhanced the DBSCAN method by integrating the Mahalanobis distance metric for vessel behavior modeling.The proposed methodology includes clustering historical AIS data and detecting anomalies.The study showcases applicability to diverse water regions, contributing to situational awareness, collision prevention, and route planning.
Farahnakian et al. [19] conducted a comprehensive examination of clustering-based techniques, including k-means, DBSCAN, Affinity Propagation (AP), and the Gaussian Mixtures Model (GMM), for detecting abnormal vessel behaviors from AIS data.Results indicate that k-means is particularly effective in detecting dark ships and spiral vessel movements, which is crucial for enhancing maritime safety.Furthermore, the study [20] proposed two methods for trajectory outlier detection, with the first utilizing DBSCAN clustering and Hausdorff distance, and the second employing Support Vector Machine (SVM) classifier and the Generalized Sequence Pattern algorithm.Both models outperform the baselines, with the SVM approach demonstrating superior performance in the identification of traffic patterns and outliers.
The research paper [21] offers a detailed survey of visual an-alytics for vessel trajectory data.The authors discuss a variety of methods, including map-based visualization, timeline-based visualization, and interactive visualization.
The paper [22] comprehensively reviews various approaches for vessel trajectory predicting, including clustering algorithms and machine learning algorithms.It also discusses the challenges and future research directions, such as the uncertainty in the data, the dynamic environment, and the computational complexity.
Among the identified challenges, which are subjects of ongoing research and require additional attention, three are worthy of specific mention: navigating dynamic maritime environments poses a substantial challenge in accurately identifying vessel paths (I); ensuring stability, explainability, and managing the computational cost of the model add further complexity (II); finally, addressing the need for flexibility, scalability, and practical applicability is crucial for a comprehensive solution in the field of vessel path identification (III).Motivated by these challenges, we aim to develop a framework that focuses on vessel path identification and potentially tackling such challenges faced in maritime transportation.

III. METHODOLOGY
This section covers the theoretical background and description of our proposed framework's underlying methodology.The framework of vessel path identification is depicted in Figure 1.
1) Problem Formulation: The equations (1-3) serve as a mathematical representation to describe the clustering of vessel paths.It is worth mentioning that the clustering process is conducted sequentially, point by point, while the labeling of path classes is performed for the entire voyage.Therefore, each voyage has a single label of path class.

V oyages ∈ P ath Classes
(1) where: V oyages: a collection of time series, each representing a voyage of the vessel taken following a given path, with a predicted clustered class.ts j : a time series corresponding to voyage j, i.e., a sequence of n data points, where each data point p represents the vessel position and is defined by a pair of coordinates, namely latitude and longitude.n: the number of time steps (duration) of each voyage, which can differ from one voyage to another.j: a total number of voyages.P ath Class Set: a set of k classes into which the path of voyages are being clustered.2) Distance-Based Method: The similarity between two paths is measured by the average nearest neighbor distance (ANND), as shown in Eq. ( 4). where: AN N D(i, j): is the average nearest neighbor distance between path i and path j, present in the distance matrix at row i and column j.It is a symmetric, meaning that AN N D(i, j) is the same as AN N D(j, i) Distance(P k i , N N (P k j )) : The distance between the k th point in path i, denoted as P k i , and its corresponding nearest neighbor point in path j, indicated as N N (P k j ).n i is the total number of points in path i.The measure Distance is an Euclidean distance.However, for longer curved routes, Haversine or Great-circle distance would be more suitable.
The ANND, as expressed in Eq. ( 4), is computed by averaging the distances between each point in one path and its nearest neighbor in the other path.Then, the similarity value (i.e., ANND) of this pair of paths is stored as an element in the distance matrix.A lower ANND indicates that the paths within a cluster are more similar.The distance matrix will have dimensions (m × m) , where m is the number of paths.For instance, the computed distance matrix for a set of 12 paths is illustrated in Figure 2.
After the construction of the distance matrix, the machine learning (ML) technique is applied to cluster the paths based on their corresponding values in this distance matrix.The ML techniques that we used are k-means, Gaussian Mixture Model (GMM), and hierarchical clustering.3) Segmented Gaussian Likelihood Method: In addition to identifying the vessel's path, to better understand how the vessel changes its paths, we employ Gaussian distributions on several distinct segments of the vessel route.This technique can be summarized as follows: • Utilize a training dataset comprising vessel position information that should adequately represent all potential paths of the vessel route.
• Divide the route into different distinct segments.
• Train a single GMM model for each segment to find the Gaussian distributions of all segments of the route.
• Estimate likelihoods of the segments by using the trained GMM models with their corresponding segments of each vessel voyage in the test dataset.
• Label the path classes based on the estimated likelihood at the unique segments of the route.
IV. CASE STUDY In this section, we describe the case study, including the data collection, preprocessing, and analysis.

A. Data Collection
In this study, we utilized a dataset that was collected from a ship named Cinderella II, which is a passenger ship that operates in the Stockholm archipelago.The dataset spans over five months, from July to November 2022.The dataset comprises information on 124 voyages of this vessel, connecting the two main ports of Vaxholm in the east and Sodra in the west.
Additional information about Cinderella II can be found in [23].For instance, the specifications of this vessel are 19 meters in length and 6.41 meters in breadth, boasting a gross tonnage of 68.Operating under the flag of Sweden, its maritime mobile service identity (MMSI) is 265513810 with a reported draught of 2.5 meters and a recorded average speed of 8.2 knots (4.2 m/s).

B. Data Preparation and Analysis
In our approach, we emphasize the significance of data representation.As a result, we group the path points based on their timestamps with a resolution of one second and store these grouped path points with distinctive Voyage IDs.Afterward, these paths are ready to be processed by the path clustering approach to determine the overall path class.In order to exploit the resulting class information, statistical analysis is conducted to determine differences in vessel paths concerning fuel, time, distance, and speed.The histograms representing these quantities across various path classes can be found in Figure 7. Notably, when the vessel traverses the shorter southern paths, it employs slower speeds, effectively reducing fuel consumption without significantly impacting travel time.

V. RESULTS AND DISCUSSION
In this section, we present the results of our spatial clustering approach for vessel path identification and discuss the implications of these findings.The approach involved the utilization of position information and various clustering techniques, specifically k-means, hierarchical clustering, and Gaussian distributions clustering, with the dataset containing 124 voyages.

A. Evaluation of Path Clustering
The results of vessel path identification are evaluated through visual inspection and tabulation using metrics such as confusion matrix, precision, recall, and F1-score [24].
The hits and messes of path clustering are presented by the confusion matrix.For our results of path clustering, the confusion matrix is a one-vs-one type matrix.Then, the confusion matrix is converted into a one-vs-all type matrix (binary-class confusion matrix) as shown in Eq. ( 5), for calculating classwise metrics like precision, recall, and F1-score.
The following performance metrics were used: • Precision: the ratio of true positives to the total number of predicted positives.
• Recall: the ratio of true positives to the total number of actual positives in the data.

P recision =
T P T P + F P (6)

B. Discussion
Table I shows the results of applying k-means or Gaussian Mixture Model (GMM) models to identify the vessel paths from the distance matrix in the distance-based method of the path identification approach.Notably, the paths with classes of North-West, South, and South-West achieved an F1-score of 1.0, indicating that the approach correctly identified all the paths of these classes.
In contrast, the North-East and North-Middle paths exhibited lower F1 scores compared to other classes.The path class of North-Middle is the most challenging path to identify since six such paths have been clustered as North-East, as can be seen by comparing Figures 8 and 9, which are the visualization for all the paths, color-marked based on their ground truth classes.Figure 10 illustrates the probability distribution of mis-clustered paths with respect to latitude and longitude coordinates.It is obvious that these paths have nearly identical coordinates, which makes them challenging paths to cluster with k-means or GMM.This suggests that there is still room for improvement by using other ML clustering methods.
Table II presents the results of employing hierarchical clustering to the distance matrix in the distance-based method for clustering the vessel paths.In hierarchical clustering, there is a parameter called "Dendrogram Cut-off threshold," and its value should be selected depending on the number of path classes.Hence, as illustrated in Figure 11, this parameter is denoted by the Y-axis as a clustering height, and its value is set to 100 for clustering the vessel path into five classes.
Remarkably, all path classes achieved an F1-score of 1, indicating that hierarchical clustering successfully identified all paths with high accuracy from the distance matrix using the distance-based method.This suggests that the choice of ML clustering technique with the distance matrix can influence the accuracy of path identification.Table II displays the outcomes of clustering, now by applying the segmented likelihood Gaussian method.This method achieved perfect precision, recall, and F1-score for all path classes.Figures 12, 13, and 14 present visualizations for the segmented Gaussian likelihood method.
The accuracy in results by hierarchical and segmented Gaussian likelihood clustering for path classes indicates the efficacy of the developed approach of spatial clustering for vessel path identification.

VI. CONCLUSION
The proposed approach is able to identify the vessel paths with partially defined or unknown paths.In the distancebased method, the hierarchical clustering used in the approach outperforms k-means and GMM clustering techniques.
The approach of hierarchical clustering includes a usercustomizable parameter, a cut-off threshold, which allows  desired control for the number of path classes, enhancing the flexibility and adaptability of the proposed approach.
In the distance-based method, adopting ANND as a measure of similarity makes path clustering less affected by noise or outliers and provides a more intuitive interpretation of path similarity, ultimately enhancing the robustness and interpretability of our approach.
The segmented Gaussian likelihood method is particularly useful for identifying and analyzing the vessel path alterations at different segments of the vessel route.
The proposed approach is computationally efficient and has the potential to be a valuable tool for planning vessel paths.Accurate path identification can contribute to safer and more efficient maritime transportation practices, aiding in route planning, collision avoidance, and navigation optimization.
Nevertheless, the framework has some potential limitations, such as the segmented Gaussian likelihood method exhibiting sensitivity to segment definition, which could affect its salable performance, particularly in complex maritime scenarios.Moreover, while the study case demonstrates that the framework is computationally efficient, it is essential to discuss any potential scalability issues, especially when dealing with large datasets, since the computational efficiency may vary depending on the dataset size and the nature of the paths.
Further work could explore the scalability and real-world applicability of the proposed clustering approach, as well as its integration with related systems of maritime transportation.

APPENDIX SUPPLEMENTARY MATERIALS
The source codes that are implemented on Python 3.9.7 to produce the results are available at: https://github.com/MohamedAbuella/PathClustering.The dataset is private and cannot be shared due to the crucial commercial interests of the startup company operating the iHelm system.

Fig. 2 :
Fig. 2: Part of the distance matrix, in this case showing 12 paths.

Fig. 9 :
Fig. 9: Results of both k-means and GMM clustering to five path classes.

Fig. 10 :
Fig. 10: Probability distribution of location coordinates for mis-clustered paths by both k-means and GMM.

Fig. 11 :
Fig. 11: Results of hierarchical clustering to five path classes.

Fig. 12 :
Fig. 12: Distribution of the route into eight segments.

Fig. 13 :
Fig. 13: Probability distributions of location coordinates for the eight segments of the route.

Fig. 14 :
Fig. 14: Gaussian distributions for seven segments of the route.

Mohamed
Abuella received his M.S. and PhD degrees in Electrical and Computer Engineering from Southern Illinois University at Carbondale and University of North Carolina at Charlotte, in 2012 and 2018 respectively.He is a postdoctoral researcher at Halmstad University since 2022.His research interests include energy analytics and AI for sustainability.M. AMINE ATOUI obtained his Ph.D. degree from LARIS, Polytech' Angers, France, in 2015.Currently, he is affiliated with the Center for Applied Intelligent Systems Research at Halmstad University, Sweden.His research interests encompass probabilistic and explainable Machine Learning, causal and Bayesian Inference, transmission/communication, and automatic control.Slawomir Nowaczyk is a Professor in Machine Learning at the Center for Applied Intelligent Systems Research, Halmstad University, Sweden.He received his MSc degree from Poznan University of Technology in 2002 and his PhD from the Lund University of Technology in 2008.During the last decades, his research has focused on machine learning, knowledge representation, and self-organising systems.The majority of his work concerns industrial data streams, often with predictive maintenance as the goal.Given that accurate and relevant labels are usually impossible to obtain in such settings, Slawomir's contributions primarily take advantage of proxy labels, such as transfer learning and multitask learning, or concern semi-supervised and unsupervised modelling.He is a board member of the Swedish AI Society and a research leader for the School of Information Technology at Halmstad University.Slawomir has led multiple research projects on applying Artificial Intelligence and Machine Learning in different domains, such as transport and automotive, energy, smart cities, and healthcare.In most cases, this research was done in collaboration with industry and public administration organisations -inspired by practical challenges and leading to tangible results and deployed solutions.Simon Johansson is a MSc graduate of Chalmers University of Technology's program in Engineering Mathematics and Computational Science in 2020, currently works in Cetasol, a marine company specialising in CO2 reduction and energy optimisation for vessels.His practical application of computational methods and dedication to environmental sustainability align with his role, contributing to global efforts to mitigate climate change.Simon's commitment to advancing eco-friendly solutions in the marine industry reflects a seamless integration of academic excellence and real-world impact.Ethan Faghani is the CEO and founder of Cetasol.Before Cetasol, Ethan was Chief Engineer of Automation and AI at Volvo Penta.Ethan has experience working with cutting-edge technologies in other transportation segments in both big enterprises and his own founded startup.Ethan obtained his Ph.D. in mechatronics from UBC and Innovation and Entrepreneurship from Stanford Business School.

TABLE I :
Result of implementing of both k-means and GMM clustering to five path classes (a) Precision, Recall, and F1-score

TABLE II :
Result of implementing both hierarchical clustering and segmented likelihood Gaussian clustering to five path classes (a) Precision, Recall, and F1-score