Asynchronous Trajectory Matching-Based Multimodal Maritime Data Fusion for Vessel Traffic Surveillance in Inland Waterways

The automatic identification system (AIS) and video cameras have been widely exploited for vessel traffic surveillance in inland waterways. The AIS data could provide vessel identity and dynamic information on vessel position and movements. In contrast, the video data could describe the visual appearances of moving vessels without knowing the information on identity, position, movements, etc. To further improve vessel traffic surveillance, it becomes necessary to fuse the AIS and video data to simultaneously capture the visual features, identity, and dynamic information for the vessels of interest. However, the performance of AIS and video data fusion is susceptible to issues such as data spatial difference, message asynchronous transmission, visual object occlusion, etc. In this work, we propose a deep learning-based simple online and real-time vessel data fusion method (termed DeepSORVF). We first extract the AIS- and video-based vessel trajectories, and then propose an asynchronous trajectory matching method to fuse the AIS-based vessel information with the corresponding visual targets. In addition, by combining the AIS- and video-based movement features, we also present a prior knowledge-driven anti-occlusion method to yield accurate and robust vessel tracking results under occlusion conditions. To validate the efficacy of our DeepSORVF, we have also constructed a new benchmark dataset (termed FVessel) for vessel detection, tracking, and data fusion. It consists of many videos and the corresponding AIS data collected in various weather conditions and locations. The experimental results have demonstrated that our method is capable of guaranteeing high-reliable data fusion and anti-occlusion vessel tracking. The DeepSORVF code and FVessel dataset are publicly available at https://github.com/gy65896/DeepSORVF and https://github.com/gy65896/FVessel, respectively.

prediction [2], radar-based object detection [3], and videobased object detection [4].It is well known that each type of sensor has its own advantages and disadvantages under the same scenarios.As a consequence, numerous efforts have been devoted to simultaneously exploiting the multi-source data [1], [5]- [9] to promote the traffic situational awareness for maritime transportation systems.However, these fusion methods mainly just take into consideration the positional relationship of the same target at a certain moment.It thus becomes difficult to guarantee high-quality data fusion, especially for the existence of time delay, missing data, random outliers, etc.The same moving vessels essentially share similar navigation behaviors, which could be represented using the time-series data, e.g., spatio-temporal trajectories.To further improve the stability and accuracy of data fusion, we will first extract the vessel trajectories from the raw sensing data, and then propose a trajectory matching-based fusion method (termed DeepSORVF) in this work.

A. Motivation and Contribution
Owing to the remote, intuitive, and real-time advantages of CCTV, terrestrial video surveillance systems have been widely used in inland waterborne transportation to improve the ability of traffic situational awareness and vessel abnormal behavior monitoring [10].In particular, massive monitoring cameras can provide indispensable visual information for guaranteeing maritime safety.To fully use these visual features, many efforts have focused on the research of vessel detection and tracking to meet the requirement of intelligent supervision [11]- [13].However, these methods could only detect the moving vessel from the video images.It is intractable to achieve the important identity information (e.g., vessel name and size, etc.) and dynamic information (e.g., vessel speed and course, etc.).
Other maritime awareness equipment, such as AIS and radar, could provide much richer attribute information about the vessel.In particular, the AIS data contains rich vessel identity and spatio-temporal information, which makes it play an essential role in analyzing vessel abnormal behaviors.The AIS data mainly contains the static and dynamic information, e.g., maritime mobile service identification (MMSI), vessel size, speed, course, position, etc.However, the AIS data essentially suffers from the inconsistency of time intervals, which limits its application in maritime intelligent transportation [14].The radar has been widely used in near port supervision arXiv:2302.11283v1[cs.CV] 22 Feb 2023 since it can provide the accurate distance and bearing of vessels.Unfortunately, some radar equipment is forbidden to be installed in populated regions to avoid the high-frequency electromagnetic radiation harming the health of people [1].In the literature [15], many methods have been proposed to robustly and accurately fuse the AIS and radar data.As long we simultaneously collect the AIS, radar, and video data, we can directly adopt the existing advanced methods to fuse the AIS and radar data, and then implement fusion with the video data.Intuitively, the fusion of AIS and video data seems more difficult because of the different coordinate systems, asynchronous data collection, different data structures, etc.Therefore, we tend to only fuse the AIS and video data to enhance the traffic situational awareness for intelligent surveillance in inland waterways.
In this work, we propose a deep learning-based simple online and real-time vessel data fusion method (termed Deep-SORVF) for promoting inland waterways surveillance.The main contributions of this work are as follows: • We build two simple yet efficient methods to respectively extract the AIS-and video-based vessel trajectories for data fusion.To avoid the interference of vessel occlusion on video-based trajectory extraction, we propose a prior knowledge-driven anti-occlusion tracking method.• We design a novel asynchronous trajectory matching method to achieve the robust fusion of AIS and video data.The proposed method adopts an enhanced fast dynamic time warping algorithm for trajectory similarity measure and employs an AIS/video association method to decrease the computational cost and increase the stability.• We construct a public benchmark dataset (termed FVessel) for vessel detection, tracking, and data fusion, which consists of many videos and the corresponding AIS data collected in various weather conditions and locations.To our best knowledge, our DeepSORVF is the first trajectory matching-based computational method to fuse the AIS and video data for inland waterways surveillance.Meanwhile, we have verified the effectiveness and robustness of the proposed method on our newly-developed FVessel dataset.

B. Organization
The rest of this paper is organized as follows.Section II briefly reviews the recent research on object detection, tracking, AIS and video data fusion.In Section III, the proposed data fusion framework is described in detail.Section IV implements extensive comprehensive experiments to demonstrate the effectiveness of our method.Finally, Section V summarizes the main contributions of this work.

II. RELATED WORKS
This section mainly introduces the recent studies related to our work, i.e., multi-object detection and tracking, AIS and video data fusion.

A. Multi-Object Detection and Tracking
Multi-object detection and tracking methods are generally divided into two categories, namely traditional and deep learning methods.Due to the particularity of the research issue, this section mainly reviews the related works on vessel detection and tracking.
1) Traditional Methods: Background subtraction (BS) is a classic object detection method.Although many BS-based methods are proposed to detect conventional objects, these methods still achieve poor precision in vessel detection [16].To improve the precision of the BS-based method, Hu et al. [17] designed a robust foreground detection and background update method to effectively reduce the influence of waves.Bloisi et al. [18] proposed an independent multi-modal background subtraction (IMBS) algorithm.In particular, this algorithm models highly dynamic backgrounds (e.g., water) by creating a "discretization" of an unknown distribution.Furthermore, other types of vessel detection methods are proposed.For instance, Zhu et al. [19] designed a hierarchical complete-based vessel detection approach for spaceborne optical images.Zhang et al. [20] proposed a vessel detection algorithm using the discrete cosine transform (DCT)-based Gaussian mixture model (GMM) for efficient visual maritime surveillance on non-stationary surface platforms.Chen et al. [21] achieved the vessel object tracking using multi-view learning and sparse representation.Although many techniques have been introduced to improve the performance of detectors, hand-designed features still produce poor robustness in vessel detection.Meanwhile, the high computational complexity of some methods will hinder their practical applications.
2) Deep Learning Methods: With the emergence and rapid development of graphics processing units (GPU), deep learning technology is widely used in the field of image processing.Many deep learning methods are proposed for object detection, e.g., region-based convolutional neural network (R-CNN) [22], [23], single shot multibox detector (SSD) [24], [25], and you only look once network (YOLO) [26]- [29].Based on these object detection networks, many vessel detection methods are further researched.Shao et al. [30] proposed a YOLOv2-based saliency-aware network for vessel detection, which combined the salient features and coastline features to predict more accurate vessel positions.Liu et al. [12] built an enhanced YOLOv3 network to promote vessel detection in video-based maritime surveillance.To reduce the impact of poor weather environments on vessel detection, this method constructed a data enhancement strategy to improve vessel detection precision in low-light, hazy, and rainy images.Furthermore, Chen et al. [31] proposed a small vessel detection method based on an improved generative adversarial network (GAN) and a convolutional neural network (CNN).Feng et al. [4] proposed a ship detection method based on the multi-size gradient features and multi-branch support vector machine (SVM).Yang et al. [32] applied the visual object tracking and semi-supervised object segmentation to the vessel tracking task, and proposed an enhanced SiamMask network.

B. AIS and Video Data Fusion
In current literature, many AIS and video data fusion methods have been proposed.For instance, Chen et al. [5] proposed a single-vessel tracking method by combining AIS and video data.In particular, this method could make the camera focus on the vessel according to the position information provided by the AIS, and use the Kalman filter to ensure the smoothness of the tracking.However, the operator fails to accurately obtain the identities and attributes of each vessel when the field of view exists multiple vessels.Therefore, more researchers began to focus on the information fusion of multiple vessels.For instance, Man et al. [6] fused the AIS and video data with the Kalman filter to obtain the optimal vessel trajectory.Bloisi et al. [1] proposed an automated maritime surveillance system that replaces radar sensors with vision sensors, which can be deployed in densely populated regions.Lu et al. [7] proposed a vision and AIS fusion method, which estimated the distance and azimuth of the detected visual vessel from the camera and fused it with the position information in the AIS data.Huang et al. [8] designed a novel multi-vessel tracking technology based on the improved single shot multi-box detector (SSD) [24] and DeepSORT [33] algorithm, and used a multi-modal data fusion algorithm to display the AIS information of visual targets.Recently, Liu et al. [9] constructed an intelligent edge-enabled shipboard navigation system based on augmented reality, deep object detection, and multi-source data fusion technologies.This system can achieve stable vessel detection under various complex weather conditions and fuse the detected vessel targets with synchronized AIS information.

III. DEEPSORVF: DEEP LEARNING-BASED SIMPLE ONLINE AND REAL-TIME VESSEL DATA FUSION
In this section, the details of our method will be introduced.Fig. 1 displays the flowchart of our data fusion method, including AIS-based vessel trajectory extraction, video-based vessel trajectory extraction, and asynchronous vessel trajectory matching.For the AIS data, we perform data cleaning and delayed data prediction to obtain high-quality AIS data.To guarantee that the AIS and video data are in the same coordinate system, we use the pinhole model to project the AIS data to the pixel coordinate system.For the video data, we first use the YOLOX network to detect vessel targets.To avoid the impact of vessel occlusion on the video-based trajectory, a prior knowledge-driven anti-occlusion tracking method is then used for video-based vessel trajectory extraction.During trajectory matching, we adopt the enhanced fast dynamic time warping algorithm (E-FastDTW) to calculate the similarity between trajectories and combine the Hungarian algorithm to obtain the matching results.It is worth mentioning that the matching result will be input into the video-based vessel trajectory extraction task at the next moment as prior knowledge.Based on our matching results, the AIS information (including, MMSI, longitude, latitude, speed, course, heading, etc.) and the visual vessel can be easily fused to facilitate inland waterways surveillance.

A. AIS-Based Vessel Trajectory Extraction
The AIS is widely used in maritime services since it can provide the integrated and rich vessel information.However, due to the limitation of the AIS working principle, AIS data fails to be required in real-time.Meanwhile, some abnormal and redundant AIS information will affect the accuracy and robustness of AIS-based trajectory extraction.Furthermore, the matching of these two data becomes increasingly difficult since AIS-and video-based trajectories are, respectively, in the WGS-84 and pixel coordinate systems.Therefore, we tend to extract the AIS-based trajectory projected in the pixel coordinate system for data fusion.To achieve this goal, we first process the AIS data to generate high-quality AIS data.Subsequently, the AIS-based vessel trajectory will be obtained by the pinhole model.
1) AIS Data Processing: Fig. 2 displays our framework for processing AIS data.The historically processed AIS data and the AIS data received at the current moment are combined as the input.The input data is successively processed by data cleaning, data prediction, and data re-cleaning to obtain high-quality AIS data as output.The data cleaning process is used to delete the AIS data outside the supervision region and abnormal data including the missing and abnormalities of the latitude, longitude, heading, speed, and MMSI.The data prediction module can estimate the position of vessels that have not yet received AIS information.Let v t−1 be the speed of the vessel at time T t−1 , the moving distance D ∆t at the time interval ∆t = T t − T t−1 can be expressed as D ∆t = v t−1 * ∆t.According to the longitude λ t−1 , latitude φ t−1 , and course θ t−1 at time T t−1 , and the moving distance at time interval D ∆t , the longitude λ t and latitude φ t at time T t can be generated by the forward geodetic computations 1 .
2) Vessel Positioning via Coordinate Transformation: To fuse the high-quality AIS information and visual object, it is necessary to unify the different source data into the same coordinate system.In this work, we tend to project the AIS information in the world coordinate system (WCS) to the pixel coordinate system (PCS).Before the coordinate transformation, we first perform a Mercator projection on the original position of the AIS information.Let (U, V, W ) be the real vessel position in the 3D WCS, its 2D projection coordinate (x, y) in PCS can be obtained by with Z being the scale factor.Here, K in and K ex are the internal and external parameter matrices of the camera, respectively.In this work, since the camera is fixed, we set the extrinsic parameter matrix K ex as an identity matrix.In particular, we directly use the pinhole model to estimate K in .Please refer to Ref. [34] for more details on the internal parameter estimation.Finally, we sequentially save the AIS data with the same MMSI into the same list in time series to build a set of all AIS-based vessel trajectories T ais = {X a1 , ..., X ai , ..., X a I } with X ai and I being the i-th AIS trajectory and the number of AIS-based vessel trajectories.

B. Video-Based Vessel Trajectory Extraction
Although many methods have been proposed to achieve vessel detection and tracking [4], [35], [36], it is still intractable to extract high-quality video-based vessel trajectories for data fusion.In the actual application of video-based maritime surveillance, the inevitable occlusion between vessels occurs in the cross encounter, confrontation, and chasing situations.Generally, it becomes difficult to accurately and robustly detect these vessels under the occlusion condition.Meanwhile, the corresponding appearance will be seriously affected by other vessels.To improve the quality of extracted trajectories, we propose a prior knowledge-driven anti-occlusion tracking method, as shown in Fig. 3.
Specifically, we first adopt the YOLOX network to detect the visual vessel object and get a set of bounding boxes, i.e., where box l is the location of the l-th bounding box, L denotes the number of bounding boxes.Before the tracking, the results of the previous moment are input as the prior knowledge.Firstly, a set of the occlusion areas OAR will be used, which depends on the ratio of the occlusion area to the bounding boxes.The judgment metric of the occlusion area can be expressed as follows where S o is the area of the occluded part, S r is the area of the r-th occluded bounding box, R is the number of occluded bounding boxes, ω represents the anti-occlusion threshold.
When the ratio exceeds ω, we will store the location of the smallest rectangle box AR which can contain all occluded bounding boxes into the OAR.Meanwhile, the AIS-based vessel trajectories at the current moment T ais and the videobased vessel trajectories at the previous moment T last vis are also used as prior information, which can be given by where X ai and Y vj represent the trajectory series of the i-th AIS target and the j-th visual target, respectively, I and J are the numbers of AIS-and video-based vessel trajectories, respectively.Besides, we consider the AIS/video association results B last at the previous moment and the vessel appearance embedding F last id before the occlusion, which can be given by where [a i , v j ] means that the i-th AIS target a i and j-th visual target v j are successfully associated, f v b is the appearance embedding of the b-th visual target before occlusion2 .We will detailedly describe the generation of AIS/video association results B last in Section III-C.For the anti-occlusion tracking, the detection results located in the occlusion area OAR are removed to avoid the misdetection caused by the vessel overlapping.Based on the fusion results at the previous moment, the corresponding AIS information of the occluded visual vessel is available.Therefore, the location of the occluded vessel's bounding box at the current moment box pre can be estimated by where (x last tl , y last tl ) and (x last br , y last br ) are the pixel indexes of the top-left and bottom-right points of the previous bounding box, respectively, ∆x ais and ∆y ais are the horizontal and vertical motion speeds, which are equal to the displacement of the AIS information between the current and previous moments.
For the occluded vessels without the corresponding AIS information, the bounding box will be predicted via the visual motion features.The prediction result can also be given by variants based on Eq. ( 6).The horizontal and vertical motion speeds (∆x ais , ∆y ais ) are replaced with the visual trajectorybased horizontal and vertical motion speeds (∆x vis , ∆y vis ), which can be calculated by where (x t−1 , y t−1 ) and (x t−δ , y t−δ ) denote the points of the video-based vessel trajectory at the previous moment and the previous δ moment, respectively.After prediction, we will then update the OAR and F id .Based on the predicted detection box position, the occlusion area list OAR will be updated via Eq.( 3) for the antiocclusion tracking at the next moment.For the update of F id , we first set up the occluded visual target as v j .If f vj exists in the original F id , we directly store f vj into the new F id ; otherwise, the appearance embedding of v j at the previous moment will be stored in the new F id .Then, we employ a wide residual network G to extract the vessel appearance embedding in normal bounding boxes, and assign the vessel appearance embedding before occlusion in the F id to occluded bounding boxes.Finally, the bounding boxes and the corresponding vessel appearance embedding are jointly input into the DeepSORT for generating the video-based vessel trajectories at the current moment T vis .
It is worth mentioning that two metrics in DeepSORT can solve the ID assignment issue.Firstly, the Mahalanobis distances between the predicted Kalman states and the newly arrived locations are calculated as the location similarity metrics.Moreover, the cosine distances between the appearance embedding are calculated as the appearance similarity metrics.In our method, the appearance features of the occluded vessels are kept consistent with the latest extractions before the occlusion.Therefore, as long as the predicted bounding box is close to the prediction of Kalman filters, the ID of occluded vessels will not be assigned incorrectly.The pseudo code of the proposed anti-occlusion tracking method is shown in Algorithm 1.

C. Asynchronous Vessel Trajectory Matching
In this section, we propose a simple yet effective trajectory matching method to fuse the AIS-and video-based asynchronous vessel trajectories.Firstly, we adopt an enhanced fast dynamic time warping (E-FastDTW) algorithm considering the direction to calculate the similarity of AIS-and video-based vessel trajectories.Based on the similarity measure result, the Hungarian algorithm is employed to generate the optimal matching result.To improve the stability and robustness of data fusion and reduce the computational cost, we employ an AIS/video association mechanism.When the number of successful pairings of two trajectories exceeds a pre-determined threshold, the AIS-and video-based vessel trajectories will be associated directly without similarity evaluation.
1) Trajectory Similarity Measure via E-FastDTW: For trajectory-based data fusion, it is an important prerequisite to determine the similarities between the AIS-and video-based vessel trajectories.The Euclidean distance is a simple but effective similarity calculation method.However, it requires that the two trajectories to be matched have the same length.Meanwhile, the Euclidean distance considers that two similar trajectories with only a slight shift in the time axis are significantly different.Therefore, dynamic time warping (DTW) has been proposed for ignoring this shift [38].Suppose we have two trajectories X and Y of length P and Q respectively, represented as X = m 1 , m 2 , ..., m p , ..., m P , Y = n 1 , n 2 , ..., n q , ..., n Q . ( Based on the two trajectories, the DTW constructs a P × Q alignment matrix d where d(p, q) is the Euclidean distance between the points m p and n q .Then, a warp path W is defined to construct the mapping between X and Y , which can be written by with C being the length of W , and max{P, Q} ≤ C < P +Q.
In particular, the warp path W has three restrictions.For the sake of better understanding, we define the (c − 1)-th and the c-th elements of W as w c−1 = (p , q ) and w c = (p, q).These three constraints for warp path can be defined as follows: • Restriction 1: The 1-st and the C-th elements of W are w 1 = (1, 1) and w C = (P, Q), respectively.• Restriction 2: The adjacent elements of the warp path W can only contain the adjacent coordinate points, including the diagonal adjacent.Therefore, the w c−1 can only be one of {(p − 1, q), (p, q − 1), (p − 1, q − 1)}.• Restriction 3: The elements of the warp path W are monotonically increasing in time, i.e., p ≤ p and q ≤ q.Under the premise of satisfying the above three constraints, DTW only focuses on the path with the minimum cumulative distance of alignment matrix elements corresponding to all points [38].Meanwhile, the included angle ϕ between the starting and ending points of X and Y is also considered.Finally, the similarity value S(X, Y ) between X and Y calculated by our proposed E-FastDTW can be written as follows (10) where d(w cp , w cq ) is the Euclidean distance between two data points corresponding to the c-th element in the warp path W , Dis(W ) denotes the sum of all d(w cp , w cq ) in the warp path W .To find the desired unique warp path, the DTW adopts the dynamic programming strategy.The cumulative distance D(p, q) between m p and n q is the sum of the minimum cumulative distance of three previous possible warp path elements and the Euclidean distance d(p, q) between the points m p and n q , which can be mathematically written as Furthermore, we also adopt the multi-level approach used in the FastDTW to speed up the time series similarity search and reduce the computational complexity.Please refer to Ref. [39] for more details on the multi-level approach.
2) Trajectory Matching: In this work, we propose a novel matching method with higher precision and less computation.In particular, we will match and associate the AIS-based vessel trajectories T ais mentioned in Section III-A, and the video-based vessel trajectories T vis mentioned in Section III-B, which can be defined as follows where X ai and Y vj represent the trajectories of the i-th AIS target a i and the j-th visual target v j , respectively, I and J are the numbers of AIS-and video-based vessel trajectories, respectively.Furthermore, the numbers of AIS/video matches M last and association results B last at the previous moment are also considered as input, i.e., where z ai,vj is the number of successful matches of X ai and Y vj , [a i , v j ] means that a i and v j have been associated together.In the similarity measure, it is obviously timeconsuming and intractable to adopt the E-FastDTW for calculating the similarity between all trajectories at each moment.
Inspired by the DeepSORT algorithm, we propose a trajectory association mechanism to solve these issues.In particular, if two trajectories have been recorded in the B last , the two trajectories are directly matched by default without similarity measurement with other trajectories.Subsequently, we perform the similarity measure between all trajectories and construct a similarity matrix M s of size I × J, where M s (i, j) represents the similarity value of X ai and Y vj .In particular, when the Euclidean distance between the last trajectory points of X ai and Y vj exceeds the maximum matching distance D max , we consider the two trajectories to be completely different and set M s (i, j) = +∞.When the binding trajectory pair [a i , v j ] exists in the B last , we set M s (i, j) = −∞ and set the values of other horizontal and vertical positions to positive infinity.
For other ordinary trajectory pairs that do not satisfy the above conditions, we employ Eq. (10) (i.e., E-FastDTW) to calculate the trajectory similarity.After obtaining the similarity matrix M s , we adopt the Hungarian optimization algorithm to find the optimal matching result O res , which contains the matching trajectory pair information, i.e., where [a i , v j ] means that a i and v j are matched together.Then, we will generate the AIS/video matching results M and association results B at the current moment.More specifically, we iterate through all matching trajectory pairs in the O res .If the number of matching times z ai,vj of trajectory pair [a i , v j ] in the O res exists in the M last , we will store (z ai,vj + 1) to M; otherwise, 1 will be stored to M. In addition, we save the number of matching times z ai,vj for some trajectory pairs directly from M last into M.These z ai,vj need to satisfy two conditions, which can be defined as follows: • z ai,vj must exist in M last but [a i , v j ] is not in O res .
• The time interval between the last matching moment and the current moment is less than T max .For the generation of the AIS/video association result, we set a minimum number of matches M at min as a threshold to ensure that the association information is accurate.When z ai,vj in the M is greater than M at min , we will store [a i , v j ] into B. The pseudo code of the proposed trajectory matching method is shown in Algorithm 2.

D. Implementation Details
This section mainly introduces the detailed settings of the proposed data fusion method.In particular, our method is implemented on the python 3.7 platform.All experiments and tests are conducted on a PC with Intel Core i5-10600KF CPU @ 4.10GHz and Nvidia RTX A4000 GPU.To meet the requirement of real-time processing while ensuring the accurate fusion, our method only executes one processing per second.For the AIS-based vessel trajectory extraction, we delete the data more than two nautical miles from the camera and set the maximum storage time to two minutes.For the vessel detection task, we collect 20k images containing vessel objects as the training dataset.In training, we set the epoch to 100 and employ the Adam algorithm as the optimizer.The initial learning rates for the first 50 and last 50 epochs are 10 −3 and 10 −4 , respectively.For the video-based vessel trajectory extraction, we set the occlusion area threshold ω = 0 and the  time span of visual motion feature extraction δ = 5s.For the AIS and video data fusion, we set the maximum matching distance D max as the half of the horizontal size of the image, the minimum number of matching times M at min = 15, and the maximum time threshold T max = 15s.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we conduct massive experiments on vessel detection, vessel tracking, and data fusion to quantitatively evaluate the performance of our proposed method.The running time analysis is also carried out to verify its practicality.

Benchmark Dataset
In this section, we construct a benchmark dataset 3 for vessel detection, tracking, and data fusion (named FVessel) containing 26 videos and the corresponding AIS data captured by the HIKVISION DS-2DC4423IW-D dome camera and Saiyang AIS9000-08 Class-B AIS receiver on the Wuhan Segment of the Yangtze River4 .As shown in Fig. 4, these videos were captured under many locations (e.g., bridge region and riverside) and various weather conditions (e.g., sunny, cloudy, and low-light).Table I displays more details about the FVessel dataset, including the video length, collection location, weather condition, the times of occlusions, the total number of vessels, and the number of vessels with AIS.To verify the superiority of the proposed module, we intercept ten clips existing the vessel occlusion from the FVessel dataset for comparison experiments on vessel detection, tracking, and data fusion.More detailed information on the test dataset can be found in Table II.

B. Experiments on Data Fusion
In this section, we implement the data fusion experiment to compare various methods, i.e., Euclidean distance-based data fusion (EDDF), multi-source data fusion (MSDF) [9], multimodal data-based ship tracking (MMDST) [8], DeepSORVF (w/o) without the anti-occlusion strategy, and DeepSORVF.In particular, the EDDF calculates the Euclidean distance between the pixel position points at the current moment for similarity measurement and employs the near-matching mechanism.For the point matching-based MSDF and MMDST, we  only replace our asynchronous vessel trajectory matching part with its corresponding matching method to compare the fusion effect under the premise of consistent detectors.Furthermore, all methods only process data once per second.

1) Evaluation Metric:
To evaluate the performance of data fusion, we first use a variant of multi-object tracking accuracy (MOTA) [40] as the evaluation metric and name it multiobject fusion accuracy (MOFA), i.e., where mmsi represents the identity of the vessel of interest (MMSI), F P mmsi , F N mmsi , and GT mmsi are the number of the MMSI false positive, MMSI false negative, and MMSI ground truth, respectively.Furthermore, the identification precision (IDP), identification recall (IDR), and identification F1 (IDF 1 ) are also employed as evaluation metrics.The IDP, IDR, and IDF 1 can be given by IDR = T P Fig. 7. Visual fusion results of our DeepSORVF on the FVessel dataset from Table I.
where T P id , F P id , and F N id are the numbers of the ID true positive, ID false positive, and ID false negative, respectively.In particular, the id is replaced with the identity of the vessel of interest (MMSI) in to Refs.[40], [41] for more details on the MOTA, IDP, IDR, and IDF 1 .Generally, higher MOFA, IDP, IDR, and IDF 1 mean better fusion performance.
2) Fusion Results on Ten Clips: Table III displays the evaluation results on all clips.It can be found that EDDF and MSDF perform poorly.Especially for clip-04, the MOFA is only 54.82% for EDDF and 53.73% for MSDF.The poor fusion effect stems from the fact that these methods only consider the current information without associating the historical feature.By considering the displacement direction of AIS-and video-based vessel trajectories, the MMDST greatly improves the fusion effect.However, the two DeepSORVFs based on the vessel motion trajectory matching have better performance by comparison.Particularly after implementing the anti-occlusion strategy, the performance of our DeepSORVF has improved considerably across all metrics.
To provide a more understandable explanation, we display two examples of data fusion obtained by MSDF, MMDST, DeepSORVF (w/o), and DeepSORVF shown in Figs. 5 and 6.Specifically, Fig. 5 displays the visualized data fusion result captured by the bridge region camera.Since the MSDF only considers the vessel characteristic at the current moment, the vessel information is more likely to be matched incorrectly.In the 80-th second, the MSDF, MMDST, and DeepSORVF (w/o) are unable to match the vessel identification information since the detector fails to identify the partially occluded target.
For the data collected by the riverside, the visual vessels are often more severely occluded, resulting in the complete disappearance of target features.By analyzing Fig. 6, the MSDF, MMDST, and DeepSORVF (w/o) will produce more missing detection and false matching.It is worth mentioning that the vessel occlusion will also affect the trajectory feature extraction and cause the feature matching failure.In contrast, the proposed DeepSORVF with the anti-occlusion strategy has a more stable data fusion effect and is suitable for a variety of scenarios.
3) Fusion Results on FVessel Dataset: Our DeepSORVF is also used to process more data in the FVessel dataset and calculate the MOFA, IDP, IDR, and IDF 1 .Table IV and Fig. 7 display the metric calculation results and the visualized fusion results, respectively.It can be found that the proposed method has stable fusion performance.The fusion accuracy (MOFA) of our method is between 73.19% and 98.68% and the average is 91.13%.In the evaluation of the other three metrics, our method also has a good performance.Through the comparison in Fig. 7, the results generated by our DeepSORVF are accurate and stable.The superiority of the proposed method benefits from the accurate prediction of the vessel bounding box by the anti-occlusion tracking method under the occlusion condition and the accurate matching based on the trajectory series.

C. Influence of Data Fusion on Vessel Detection and Tracking
In our proposed method, the result of trajectory matching is fed as prior knowledge to the vessel detection and tracking tasks at the next moment for promoting the video-based vessel trajectory extraction.To verify that our proposed data fusion method can improve vessel detection and tracking performance, we conduct massive experiments on ten clips.In particular, we select five different deep neural networks as detectors, i.e., Faster-RCNN [23], SSD [24], YOLOv4 [42], YOLOv5 [43], and YOLOX [28].Each detector has two versions, i.e., "Detection" and "Detection + Data Fusion".Furthermore, all methods only process data once per second.
1) Evaluation Metric: To evaluate the performance of vessel detection, we select the Precision and Recall as evaluation metrics.Let T P , F P , and F N denote the number of the true positive, false positive, and false negative, the Precision and Recall can be given by Precision = T P T P + F P , For vessel tracking, we tend to use MOTA as an evaluation metric, which can be defined as where F P , F N , ID s , and GT represent the numbers of the false positive, false negative, ID switch, and ground truth, respectively.Furthermore, we also adopt the IDP, IDR, and IDF 1 metrics.Theoretically, better detection results have higher Precision and Recall, and better tracking results have higher MOTA, IDP, IDR, and IDF 1 .
2) Vessel Detection and Tracking on Ten Clips: Table V compares the detection Precision and Recall of various detectors on ten clips.Due to the mutual occlusion between the targets, some vessel characteristics are easily hidden by another vessel.Therefore, detectors often suffer from missing detection, resulting in higher F N and poorer Recall.In most cases, detectors are prone to produce false detection boxes in vessel encounter regions due to the overlapping of multiple vessel features.These false detection boxes will produce higher F P and poorer Precision.In contrast, the proposed anti-occlusion method based on data fusion results can improve the performance of various detectors.
To compare the tracking performance, Table VI further shows the MOTA, IDP, IDR, and IDF 1 results of various detectors on ten clips.In contrast, the proposed data fusion method can significantly improve the tracking performance of all five detectors and reduce the number of missing and false detection.The performance improvement benefits from the proposed anti-occlusion method based on data fusion results.The proposed method can achieve more stable vessel tracking during the occlusion.

D. Running Time Analysis
The time complexity of the proposed method is a critical metric, which directly determines whether it can be used in actual engineering.In this work, we only process the data once per second to ensure practicability.Therefore, we are unable to use the frame per second (FPS) as an evaluation metric.Meanwhile, since the proposed method considers trajectory features, the time complexity is also related to the number and length of AIS-and video-based vessel trajectories.Consequently, it is also inaccurate to calculate the running time of a single image.Finally, we compute the processing time of

E. Discussion
Although our proposed method adopts the prior knowledgedriven anti-occlusion tracking method and trajectory matching method to effectively improve the accuracy of data fusion, our method still has some limitations.In this section, we use the multiple object fusion accuracy (MOFA) and multiple object fusion precision (MOFP) as evaluation metrics, where the MOFP is a variant of the Multiple Object Tracking Precision (MOTP) [40] in the data fusion task.The MOFP can be given by where D t,i mmsi denotes the distance of the i-th MMSI matching pair in the t-th second, N t mmsi is the number of matches in the t-th second.Theoretically, a better fusion effect has higher MOFA and lower MOFP.
Using clip-02 and clip-03 as examples, we compute their MOFA and MOFP.As shown in Table VIII, it can be found that the proposed anti-occlusion tracking method can significantly improve the accuracy of data fusion by comparing the MOFA.However, the DeepSORVF is slightly inferior to the DeepSORVF (w/o) without the anti-occlusion strategy   in the bounding box localization precision evaluated by the MOFP.The more intuitive comparisons before and after the use of the anti-occlusion strategy are illustrated in Fig. 9.
Our DeepSORVF with the anti-occlusion strategy can predict the vessel position and accurately match the occluded vessel information.However, the predicted bounding boxes still have some degree of bias in complex occlusion conditions.This deviation is mostly attributable to the inaccurate estimation of AIS and visual motion characteristics.When a vessel travels away from the camera, for instance, its visual movement speed generally slows and the object gets smaller.To further improve the vessel anti-occlusion performance, our future work will take into account the changing features of the moving vessels in the visual data.

V. CONCLUSION
In this paper, we proposed a deep learning-based simple online and real-time vessel data fusion method (named Deep-SORVF).The DeepSORVF could pair the vessel features of AIS with visual targets.Due to the fact that reciprocal occlusion between vessel targets may readily interfere with video-based trajectory extraction, we suggested a prior knowledge-driven anti-occlusion tracking method.Meanwhile, a novel asynchronous trajectory matching method was designed for robust data fusion.Comprehensive experiments on vessel detection, vessel tracking, data fusion, and running time analysis have demonstrated the superior performance of our DeepSORVF on the newly-developed FVessel dataset.

Fig. 1 .
Fig. 1.The architecture of the proposed deep learning-based simple online and real-time vessel data fusion method (termed DeepSORVF).The DeepSORVF consists of AIS-based vessel trajectory extraction, video-based vessel trajectory extraction, and asynchronous vessel trajectory matching.

Fig. 2 .
Fig. 2. The flowchart of the AIS data processing, which consists of data cleaning, data prediction, and data re-cleaning.

Fig. 3 .
Fig. 3.The flowchart of anti-occlusion tracking method for video-based vessel trajectory extraction.Note that G is the wide residual network-based appearance feature extractor.The extraction of AIS-based vessel trajectories T ais has been introduced in Section III-A.The generation of Boxes, OAR, T last vis , and F id will be mentioned in Section III-B.The generation of AIS/video association results B last will be described in Section III-C.

Fig. 4 .
Fig. 4. Some samples of the FVessel dataset, which contains massive images and videos captured on the bridge region and riverside under sunny, cloudy, and low-light conditions.

Fig. 5 .
Fig. 5. Visual comparisons of fusion results on the dataset captured by the bridge region camera from Table II.From top to bottom: visual fusion results generated by (a) MSDF [9], (b) MMDST [8], (c) DeepSORVF without the anti-occlusion strategy, and (d) DeepSORVF, respectively.

Fig. 6 .
Fig. 6.Visual comparisons of fusion results on the dataset captured by the riverside camera from Table II.From top to bottom: visual fusion results generated by (a) MSDF [9], (b) MMDST [8], (c) DeepSORVF without the anti-occlusion strategy, and (d) DeepSORVF, respectively.

Fig. 8 .
Fig. 8. Processing time of one-second data on the ten clips from TableII.

Fig. 9 .
Fig. 9. Visual comparisons of fusion results on the dataset from Table II.DeepSORVF (w/o) represents our DeepSORVF without the anti-occlusion strategy.
OAR: A set of all occlusion areas; Boxes: A set of all bounding boxes detected by YOLOX network; T ais : A set of all AIS-based vessel trajectories at the current moment; T last vis : A set of all video-based vessel trajectories at the previous moment; B last : A set of all AIS/video association results at the previous moment; F id : A set of all occluded vessel appearance features before the occlusion; Output: T vis : A set of all video-based vessel trajectories at the current moment;Remove the box l from Boxes; Algorithm 1: Anti-Occlusion Vessel Tracking Input: 7 for Y v j in T last vis do 8 for AR in OAR do 9 if The center point of the bounding box at the previous moment in Y v j locates in AR then 10 Search the matched [:, v j ] from B last ; 11 if exist [a i , v j ] then 12 Predict the bounding box box pre v j by AIS-based vessel trajectory Xa i in T ais ; 13 else 14 Predict the bounding box box pre v j by video-based vessel trajectoryY v j ; 15 Add the box pre v j to Boxes pre ; 16 break; 17 Update F id and OAR; // Step 3. Anti-occlusion DeepSORT.18 for box l in Boxes do 19 Add [box l , G(box l )] to I; 20 for box pre v j in Boxes pre do 21 for fv b in F id do 22 if v j = v b then 23 Add [box pre v j , fv b ] to I; 24 break; 25 Run DeepSORT with I; 26 Add the results of DeepSORT to T last vis for generating the video-based vessel trajectory at the current moment T vis ;

Algorithm 2 :
Asynchronous Trajectory Matching Input: T ais : A set of all AIS-based vessel trajectories; T vis : A set of all video-based vessel trajectories; M last : A set of all AIS/video numbers of matches at the previous moment; B last : A set of all AIS/video association result at the previous moment; Output: M: A set of all AIS/video numbers of matches at the current moment; B: A set of all AIS/video association result at the current moment; 1 Initialization: d(i, j): The Euclidean distance between the last trajectory points of Xa i and Yv j ; Ms: An empty trajectory similarity matrix; Ores: An empty set to save the matching results; Dmax: The maximum matching distance; M at min : The minimum number of matches; Tmax: The maximum time threshold; S: The E-FastDTW trajectory similarity measurement operator; // Step 1. Trajectory similarity measure. 2 for Xa i in T ais do 3 for Yv j in T vis do else if [a i , :] or [:, v j ] in B last then 7 if [a i , v j ] in B last then 6   13Using the Hungarian algorithm to calculate Ms for obtaining the matching resultOres = {..., [a i , v j ], ...}; 14 for [a i , v j ] in Ores do 15 if za i ,v j in M last then 16 Add za i ,v j = za i ,v j + + to M;17 else 18 Add za i ,v j = 1 to M ; 19 for za i ,v j in M last do 20 if [a i , v j ] not in Ores and the time interval between the last matching moment of [a i , v j ] and the current moment < Tmax then 21 Add za i ,v j to M; // Step 3. Association result generation.22 for za i ,v j in M do 23 if za i ,v j > M at min then 24 Add [a i , v j ] to B;

TABLE I DETAILS
OF THE FVESSEL DATASET.THE "TOO", "NOV", AND "NOA" ARE THE TIMES OF OCCLUSIONS, THE TOTAL NUMBER OF VESSELS, AND THE NUMBER OF VESSELS WITH AIS, RESPECTIVELY.

TABLE II DETAILS
OF THE DATASET USED IN THE VESSEL DETECTION, VESSEL TRACKING, AND DATA FUSION EXPERIMENTS.THE "TOO", "NOV", AND "NOA" ARE THE TIMES OF OCCLUSIONS, THE TOTAL NUMBER OF VESSELS, AND THE NUMBER OF VESSELS WITH AIS, RESPECTIVELY.

TABLE III MOFA
, IDP, IDR, AND IDF 1 RESULTS OF DATA FUSION FOR THE TEN CLIPS FROM TABLE II.(UNIT: %)

TABLE IV MOFA
, IDP, IDR, AND IDF 1 RESULTS OF DATA FUSION FOR THE FVESSEL FROM TABLE I. (UNIT: %)

TABLE V PRECISION
AND RECALL RESULTS OF VESSEL DETECTION FOR THE TEN CLIPS FROM TABLE II.(UNIT: %)

TABLE VI MOTA
, IDP, IDR, AND IDF 1 RESULTS OF VESSEL TRACKING FOR THE TEN CLIPS FROM TABLE II.(UNIT: %)

TABLE VII PROCESSING
TIME OF ONE-SECOND DATA (MEAN ± STD) ON THE TEN CLIPS FROMTABLE II.(UNIT: SEC.) -second data for ten clips in Table II.The processing time of our method for each clip is shown in Fig. 8 and Table VII.It can be seen that our DeepSORVF has low time complexity and high practicability.It can process one second of data in 0.175-0.500seconds and 0.2562 seconds on average. one

TABLE VIII MOFA
(%) AND MOFP RESULTS OF DATA FUSION FOR THE CLIP-02 AND CLIP-03 FROM TABLE II.DEEPSORVF (W/O) REPRESENTS OUR DEEPSORVF WITHOUT THE ANTI-OCCLUSION STRATEGY.