iOceanSee: A Novel Scheme for Ocean State Estimation Using 3D Mobile Convolutional Neural Network

Ocean state estimation is a basic problem in the field of ocean engineering. Under the trend of data-driven, the development of intelligent ship decision-making, ocean energy system design and other aspects, are inseparable from the estimation of wave parameters in the ocean area. In recent years, researchers have developed remote sensing technology to monitor ocean waves. However, sensor-based methods all have a key limitation, which is high cost and fault worry. More importantly, one major limitation exists in current research: due to lack of change information and relying on a single feature of spatial data, the final predictive results are inaccurate. Adopting a 3D Convolutional Neural Network is a possible solution to improve the detection accuracy. Unfortunately, it cannot be deployed in the ocean environment due to lack of physical network connections. To resolve these issues, we develop a light-weight version of 3D Convolutional Neural Network, namely a low-cost, high-accuracy detection scheme to foresee ocean wave parameters using a 3D Mobile Convolutional Neural Network technique called iOceanSee in the marine environment. iOceanSee employs a mobile terminal composed of low-cost measuring equipment and non-interference (except light) device-an RGB camera to collect video data in real time. It extracts both space and time features through three-dimensional depthwise separable convolutions. More specifically, iOceanSee is able to capture the encoding motion information from multiple adjacent frames of the video, according to which period and height of the waves being evaluated. Our experimental results conclude that iOceanSee obtains comparable performance to 3D Convolutional Neural Network and outperforms other models in terms of measurement accuracy in the marine environments.

parameters plays an important role in many marine engineering areas, such as the wave energy conversion [1] of marine energy management. For example, wave height, period, level, energy conversion and etc. are modern research hotspots.
As a renewable energy source, ocean energy converter has been studied by many scholars. Yin and Xiuxing [2] designed a novel kite-type ocean energy converter using current and wave energy, which can generate electricity under current and wave conditions at different ocean current velocities, wave periods and altitudes. Xia et al. [3] proposed a quasi-Habach magnetized field-modulated tubular linear generator (FMTLG) for wave energy conversion. The simulation results show that FMTLG is superior to TLPMG in anti-EMF and other aspects. Recent years have seen increasing research activity on wave height. Alexandre et al. [4] proposed a hybrid genetic algorithm based extreme-learning machine method to reconstruct effective waves. Considering the spatial correlation between the position values of adjacent buoys, they solved the problem of local reconstruction of failed buoys by utilizing the wave parameters of adjacent buoys. Salcedo-Sanz et al. [5] proposed to apply the Support Vector Regression (SVR) methodology to estimate the significant wave height of X-band marine radar image of sea surface by shadow effect, by which good prediction results in wave height can be obtained. Ti et al. [6] estimated the effective wave heights near shore by using the response surface method prediction equation. In the case of inadequate or difficult acquisition of near-shore measurements, they used the numerical wave model and semi-analytical fitting model derived from the conservation equation of wave energy flow to fit the response surface of near-shore wave height, establishing the prediction equation and predicting the effective wave height.
Unfortunately, most of the research methods such as x-band wave radar and meteorological remote sensing satellites have higher cost and accuracy limitations. Specifically, traditional ocean state estimation techniques usually include manual observations, wave buoys, x-band wave radars, and meteorological remote sensing satellites or weather forecasts [7]. There is a strong subjectivity in manual observation. Wave buoys have certain reliability, but the ability to resist wind and waves is weak. The use of x-band radar technology is not only expensive, but also requires regular calibration and maintenance. Meteorological remote sensing satellites are sensitive to clouds, and sensor-based methods bear a key limitation, which is high cost and fault worry.
In order to overcome the shortcomings of traditional estimation methods, researchers have carried out extensive exploration. They divided the estimation methods of ocean parameters into model-based and non-model-based methods. Model-based methods mainly use domain knowledge to build mathematical models. However, due to their excessive dependence on mathematical models and corresponding assumptions, these methods are prone to generate false recognition due to the dynamic impact of the marine environment. Nonmodel-based methods mainly use traditional machine learning or deep learning techniques to extract time and frequency features. The advantage of using this kind of method is that it does not depend on previous domain knowledge and is easier to learn. Therefore, comparing the two types of means, we choose a model-free method. Many scholars have studied the height of the waves, but there are few studies on the period of the waves.
In this paper, we studied these two kinds of ocean parameters. In order to improve the accuracy of the estimation, we must consider both time and space information. Existing methods are lacking in change information over time and exclusively rely on a single feature of spatial data, the final predictive results are less accurate. Adopting a 3D Convolutional Neural Network is a possible solution to improve the detection accuracy. The 3D convolution neural network (3D CNN) can extract the temporal and spatial features of the video by using the pictures of continuous frames in the video, so as to capture the object features better. Video data itself is a kind of high-dimensional data, which takes up more storage space for the storage of video data. Unfortunately, 3D CNN cannot be deployed in the ocean environment due to lack of physical network connections. In addition, 3D CNN contains many unneeded information which leads to nonnegligible compute and storage overhead. To resolve these issues, we develop a light-weight version of 3D Convolutional Neural Network, namely a low-cost, high-accuracy detection scheme to foresee ocean wave parameters using a 3D Mobile Convolutional Neural Network technique called iOceanSee in the marine environment. iOceanSee employs a mobile terminal composed of low-cost measuring equipment and non-interference (except light) device-an RGB camera to collect video data in real time. It extracts both space and time features just for wave related data through threedimensional depthwise separable convolutions. More specifically, iOceanSee is able to capture the encoding motion information from multiple adjacent frames of the video, according to which period and height of the waves being evaluated. iOceanSee adopts measuring equipment and an RGB camera on vessels named YUMING and XIANGYANGHONG81 to collect real-time video data and measure the height and period of the waves through an underwater pressure wave meter simultaneously. In the training phase, we take collected video data as our training data. Using them as the input, we train our model using the online error back-propagation algorithm as described in [8] on NVIDIA TITAN XP GPUs. We can extract the information both in space and time, obtain the final characteristic representation by convolution and subsampling of adjacent video frames and evaluate the wave height and period according to the training results. The output data values are compared with the actual measured values and the results obtained by other methods to evaluate the performance of the model. The experimental results show that iOceanSee achieves near-optimal accuracy of the 3D CNN model which is superior to other baseline methods. In more detail, we train and optimize the 3D MCNN using muti-task loss function. In the inference phase, the input data of the network is the real-time video data acquired by the RGB camera in the real-life marine environment. The value of the network output is the ocean wave estimation parameters predicted by our model. The schematic workflow diagram of the system is illustrated in Figure 1.
The rest of this paper is organized as follows: Next section outlines the related work of this paper. Section 3 describes the three-dimensional convolution operation used in the wave parameter detection scheme, the proposed model and implementation details. Section 4 presents details of the ocean wave dataset, experimental results and discussion, evaluation metrics and model choices. We make a summary in Section 5.

II. RELATED WORK
Although there are many research works in wave parameters prediction, their results are not accurate since most studies employ only single input data for prediction. In order to accurately predict or numerically calculate wave height and period data, there is a need for taking multiple eigenvalues into consideration. Recent years have seen several new methods are successfully exploited. For example, video recognition technology has been widely used in many fields, such as text recognition [9], face recognition [10], medical symptom recognition [11], human behavior re-identification [12], [13], crack identification [14] and so on. As the core machine of video recognition technology, deep learning model learns feature hierarchy by constructing high-level features from lowlevel features. It adopts supervised or unsupervised learning methods, adjusts the parameters of the system according to the output results, and automatically learns features according to the input data.
Convolution neural network [15], [16], as a deep learning model, alternately applies the training filter and local neighborhood pool operation to the original input image, which is mainly used for two-dimensional images. With the rapid development of artificial intelligence, two-dimensional convolution neural network has matured in the field of computer vision, and three-dimensional convolution neural network has made a major breakthrough. Two-dimensional convolution neural network can only be used for static analysis of twodimensional images, so it can only extract spatial features. In contrast, three-dimensional convolution neural network can extract not only spatial features, but also temporal features. It can extract temporal and spatial features from images of continuous frames in video, so as to capture object features better. Currently, many scholars have studied 3D CNN. Ji et al. [17] proposed a new three-dimensional CNN action recognition model, which was successfully applied to the real environment of airport surveillance video to identify human behavior. Li et al. [18] employed Gabor filter to analyze the action in video data and adopt KTH dataset to validate the model, and obtain good results. Ren et al. [19] combined several three-dimensional convolution neural networks altogether to realize automatic segmentation of small tissues in head and neck CT images. Ceschin et al. [20] developed a computational framework for automatic classification of brain dysplasia from neonatal MRI, which used three-dimensional convolutional neural network to classify cerebellar dysplasia in full-term infants with congenital heart disease, and designed a program that can be extended to both internal and external brain imaging of neonates. Zhang et al. [21] proposed a new framework of deep threedimensional convolution neural network (3D-cnns) feature network to learn the processing characteristics of mechanical parts CAD model and proved that the feature network can significantly improve the most advanced manufacturing feature detection technology by numerical experiments.
In recent years, many scholars explore how to employ the machine learning technique to study the wave parameters. Oh and Suh [22] developed an EOFWNN model, which takes into account the relationship between spatial distribution of meteorological variables and waves, and predicts wave heights at multiple locations. Compared with the previous wavelet neural network model, this model trains the observed wave data at each location, which is convenient and improves the prediction accuracy. Cornejo-Bueno et al. [23] proposed a wave parameter estimation method based on regression genetic fuzzy system (GFS, FRULER) to predict the wave height and energy. Jongchul and Dongseob [24] used CCD camera to monitor the wave parameters. They compared the wave information collected by the Humoments algorithm in Gyoam Beach, South Korea, with that obtained by the underwater pressure wave instrument, and concludes that the method can detect the wave period with slight bias. The main details of these algorithms are as follows.

A. EOF-WAVELET AND NEURAL NETWORK HYBRID MODELS
Oh and Suh [22] combined empirical orthogonal function analysis and wavelet analysis with neural network, and formulated a hybrid model -EOFWNN. By considering the relationship between wave and meteorological variables, they forecasted the significant wave heights for various lead times of eight wave stations in the coastal waters around the East/Japan. In the study, they mainly introduced EOF analysis to explain the relationship between waves and meteorological data and enable the model to predict waves at multiple stations simultaneously. You can see that spatiotemporal data is decomposed by EOFs as follows: where, {B n (x)} are called the eigenfunctions, and {T n (t)} are the expansion of the coefficients that are functions of time.
T (x, t) is called the EOF decomposition only if {B n (x)} are mutually orthogonal to each other and {T n (t)} are mutually uncorrelated, that is, where δ nm is the Kronecker delta, and λ n is the eigenvalues. Often, {B n (x)} are called the loading vectors, {T n (t)} represent the principal component (PC) time series.

B. FUZZY RULE LEARNING THROUGH EVOLUTION FOR REGRESSION
Cornejo-Bueno et al. [23] employed genetic fuzzy system to estimate significant wave height and energy flux, which consists of three staged algorithm that combines an instance selection method for regression, a multigranularity fuzzy discretization of the input variables and an evolutionary algorithm to generate accurate and simple Takagi-SugenoKant (TSK) fuzzy rules. The first two components of FRULER are part of a two-stage preprocessing step to improve the accuracy and simplicity of the fuzzy rules which are obtained by the evolutionary algorithm. The evolutionary algorithm searches for the best data base configuration using the obtained fuzzy partitions and takes advantage of the Elastic Net method linearly, which combines the l 1 (Lasso regularization) and l 2 (Ridge regularization) penalties of the Lasso and Ridge method automatically learn the importance weight of each variable, minimizing the following equation: where β is the coefficients vector, Y is the outputs vector, X is the inputs matrix, λ is the regularization parameter and α represents the trade-off between l 1 and l 2 penalization. Finally, the robust and very accurate prediction of the two target variables is obtained.

C. HU-MOMENTS ALGORITHM
Jongchul and Dongseob [24] proposed a video-based beach monitoring system. Because the change of wave arriving in shallow water is very discontinuous in CCD image clipping, they use Hu-moments algorithm to detect the wave period with weak offset, thus clarifying the complex characteristics of wave distribution and its time variation. They first extracted the outline of each video patch from data captured at Gyoam Beach, and then using seven Hu-moments to estimate the similarity matrix of the shapes in video patches V i and V j which have different frames: where V i (i = 1, 2, . . . , N ) is a sample patch of N video frames, it can convert to an edge map F i utilizing the Canny edge detector, C i g = sin h i g log 10 h i g , the seven Humoments h i g (g = 1, . . . , 7) can calculate by analyzing the first three central moments.
After that, the radial symmetry kernel is used to represent the period and height of the wave.

III. PROPOSED WORK
The detection of wave parameters has always been the key in marine engineering applications. In recent years, researchers have developed techniques for monitoring wave parameters using remote sensing technology, but sensor-based methods VOLUME 8, 2020 all have a key limitation, which is high cost and fault worry. In addition, most solutions rely on spatial data, i.e., a single feature to design their predictive mechanism, and thereby oversighting the change information over time. This leads to the inaccurate final results, which might dissatisfy thee user level agreement. It may be noted the, in the actual marine environment, due to a strong dynamicity, change information in a continuous time window is of great importance. There is a need for taking both time and space information into consideration to accurately detect or predict the height and period of ocean waves. iOceanSee captures multi-dimensional features of real-time wave data and develops a wave parameter detection scheme based on 3D mobile convolution neural network. Compared with the sensor-based method, low-cost NVIDIA Jetson TX2 and non-interference (except light) devices-an RGB camera are employed to collect video data, and fed together with the motion information of the video frames into the iOceanSee inference model. Given both time and space information, iOceanSee is able to precisely forecast the period and height of the waves. As a feasible working scheme, iOceanSee employs depth separable convolution to construct lightweight networks to reduce computational overhead. We develop a low-cost, high-accuracy detection scheme named iOceanSeee to track and predict ocean wave parameters using a 3D Mobile Convolutional Neural Network. The overall framework of the proposed model is shown in Figure 2.

A. 3D CONVOLUTION NEURAL NETWORK
Regarding the convolution of two-dimensional convolution map, since the input is provided entirely by the images, the feature can only be calculated from the spatial dimension, and the time dimension can not be convoluted. When using video to analyze problems, it is necessary to capture motion information in multiple consecutive frames. For example, Ji et.al. [17] proposed three-dimensional convolution in convolution layer by taking account into both spatial dimension and temporal dimensions. Multiple consecutive video frames are superimposed to form an input cube. After convoluting the cube with the three-dimensional core, and adding the corresponding bias, the result will become the input value of the next layer after passing through the activation function. Thus, the whole three-dimensional convolution process is completed. Each feature map is connected to several adjacent frames in the previous layer, and hence the motion information is captured.
Formally, the value at position (x, y, z) on the jth feature map in the ith layer is denoted as v xyz ij , which is expressed as where, tanh (·) is the hyperbolic tangent function, b ij is the bias for this feature map, m indexes over the set of feature maps in the (i − 1) th layer connected to the current feature map, w pqr ijm is the value at the position (p, q, r) of the kernel connected to the mth feature map in the previous layer, P i and Q i are the height and width of the kernel, R i indicates the thickness of the convolution nucleus, in other words, it is the size of the 3D kernel along the temporal dimension.

B. OVERALL FRAMEWORK
The whole scheme realizes the real-time monitoring of wave height and period in the marine environment. This scheme uses the mobile terminal composed of NVIDIA Jetson TX2 and camera to monitor the waves in real time, collect the corresponding video data, and train the video data offline. The data acquisition device shown in Figure 1 is comprised of NVIDIA Jetson TX2, an RGB camera (1080 x 1920 pixels with 30 frame per second), a computer and NVIDIA TITAN XP GPUs. The RGB camera is connected to NVIDIA Jetson TX2 via USB interface. The network is implemented using NVIDIA Jetson TX2. NVIDIA TITAN XP GPUs are used to train our network. Actual parameters measured by an underwater pressure type wave meter.
For the actual marine environment, the changes of wave parameters in different regions are often affected by natural factors such as strong winds. Hence, carrying out a realtime monitoring is challenging. It is even harder to conduct monitoring in real time. However, the detection of ocean wave parameters often makes a non-negligible impact on the development of other applications in the marine engineering field. In the marine environment, the video signal we collect is three-dimensional, and it represents an effective means of identification to employ the space-time information directly. Therefore, it is nature to apply a three-dimensional convolution neural network to identify wave parameters.
We propose a three-dimensional mobile convolution neural network model for ocean wave parameter detection using an architectural design similar to the well-known 3D CNN [17] network, as shown in Figure 2 above. Inspired by literature [25], we employ depthwise separable convolution to construct lightweight networks to reduce computational overhead. The details are as follows.
As shown in figure 2, seven frames of size 60×40 centered on the current frame and selected from video is chosen as inputs to the 3D MCNN model. A set of hardwired kernels is applied to generate multiple channels of information from the input frames, which ends up with 33 feature maps in the second layer expressed as H1 in five different channels, being denoted by gray, gradient-x, gradienty, optflow-x, and optflow-y respectively. The gray channel contains the gray pixel values of the seven input frames. The feature maps in the gradient-x and gradient-y channels are acquired by calculating gradients along the horizontal and vertical directions on each of the seven input frames, and the optflow-x and optflow-y channels contain the optical flow fields along the horizontal and vertical directions, respectively, which is computed from adjacent input frames. Our prior knowledge concerning features is encoded by the hardwired layer, and this scheme usually contributes to better performance as it is compared to the random initialization.
Next, original 3D convolutions with a kernel size of 7 × 7 × 3(7 × 7 in the spatial dimension and 3 in the temporal dimension) are applied on each of the five channels, separately. We have the same time dimension as before, and only do separable convolution for space dimension.
Likewise, we apply 3D convolution in the next convolution layer Convolution4 (Abbreviated as C4 in Figure 2) with a kernel size of 7 × 6 × 3(7 × 6 in the spatial dimension and 3 in the temporal dimension). Consistent with what was mentioned earlier, we only do separable convolution for space dimension. Standard convolution layer of size 7 × 6×3 × 30×54 is replaced by depthwise convolution layer with size 7 × 6×3 × 30 and pointwise convolution layer with size 1 × 1×3 × 30×54. Standard convolution layer of size 7 × 6×3 × 16×24 is replaced by depthwise convolution layer with size 7 × 6×3 × 16 and pointwise convolution layer with size 1 × 1×3 × 16×24. They generated a total of 78 feature maps. You can see the details in Table 1. DeepwiseConv1 and Conv1 represent a set of corresponding deep convolution and point convolution, and DeepwiseConv2 and Conv2 are the same. The next layer S5 is obtained by applying 3 × 3 subsampling on each feature map in the C4 layer, leading to the same number of feature maps with a reduced spatial resolution. At this stage, the size of the temporal dimension is already relatively small (3 for gray, gradient-x, gradient-y, and 2 for optflow-x and optflow-y), so we conduct convolution only in the spatial dimension at this layer. Because the size of the convolution kernel used is 7 × 4, the sizes of the output feature maps are reduced to 1 × 1. The C6 layer consists of 128 feature maps of size 1 × 1, each of which is connected to all 78 feature maps in the S5 layer.
The seven input frames have been converted into a 128D feature vector capturing the motion information in the input frames after multilayer convolution and subsampling. The output layer is made up of the same number of units as the number of wave features, and each unit is fully connected to each of the 128 units in the C6 layer. In this design, we essentially apply a linear classifier on the 128D feature vector for action classification.
Our 3D MCNN model also consists of two fully connected layers (FC1, FC2), which include 2048, 512 neurons, respectively. We mainly have two main tasks, one is wave period estimation and the other is wave height estimation. All trainable parameters in this model are initialized randomly and trained by the online error back-propagation algorithm as described in [6], which makes loss minimize.
It is worth noting that the lightweight implementation is only for two-dimensional spatial domains. For threedimensional convolutional neural network, the characteristics of time domain should also be taken into consideration. During the splitting process, the time dimension remains unchanged, which transforms the original 3D convolution kernel into multiple lightweight 3D cubes, and the output results retain the original time domain information.

C. IMPLEMENTATION DETAILS
where, is the parameters set of the 3D MCNN model and N is the training sample numbers. L P is the loss between the estimated period value F P (X i ; ) and the true value P i measured by wave meter. And similarly, L H is the loss between the estimated wave height F H (X i ; ) and the true value H i measured by wave meter, respectively. Here, we adopt Relative Euclidean Distance in these two objective losses. We use back propagation to make the loss minimized.
We use a linear multi-task iterative training process to train two tasks, train the input data in batches to minimize L P and L H , and use α and β to control the proportional weights of the two loss functions. Through multiple iterations of training, L ( ) is minimized.

Algorithm 1 Training Process of Our Model
Input: Size-normalized patches with period and height values from the whole training data Output: Parameters for 3D MCNN model 1 Initialize each parameter randomly; 2 t = 1 3 while (t ≤ T ) do 4 Back propagation is used to learn , until the joint loss function (Equation 9) minimize; 5 end while We compare the results after training with those measured by the under-water pressure type wave meters and get the corresponding error analysis. Furthermore, the results is also compared with EOFWNN, FRULER, HM-CCD and 3D CNN to evaluate measurement performance.

IV. EXPERIMENTS AND RESULTS
We collect wave video image data on YUMING and XIANGYANGHONG81 on March 6, 10, 14, 18, 22, 26, 30 and April 3, 7, 11 and 15, 2019, respectively, and use opencv to preprocess wave video data to crop the video to the input size of our algorithm. Furthermore, we manually divide the wave video data set into training set and test set, and label the data using underwater pressure wave meter. The video data is used as the input of the 3D MCNN network, and the wave period and wave height are used as the output of the network. Then, we make use of the training set data to train our model, and use the test set data to verify the performance and generalization ability of the model. Finally, we compare it with other algorithms that include EOFWNN, FRULER, HM-CCD and 3D CNN. The experimental results show that the proposed 3D MCNN algorithm performs better than other algorithms except 3D CNN in predicting wave period and wave height. In terms of measurement accuracy, although our 3D MCNN is slightly inferior to 3D CNN, its calculation parameters are greatly reduced.

A. OCEAN WAVE DATASET
Wave video data acquisition was carried out at 8 stations on YUMING and XIANGYANGHONG81. These data were used to train and evaluate our algorithm. The locations of the data acquisition sites are shown in Figure 3.
We divide the data set collected on different sites into two parts: training set and test set. The video data used in training set and test set are from 8 sites. Training set consists of 184 video sequences with a time length of 1 minute, and the test set contains 11 video sequences with 30 minutes each. Figure 4 shows some example frames of the wave dataset on XIANGYAGNHONG81 and YUMING.
The proposed 3D MCNN algorithm can automatically extract motion information between frames of ocean wave video. Compared with other single-frame image processing algorithms, 3D MCNN and 3D CNN algorithm can not only accurately calculate the wave height, but also accurately estimate the wave period. We train different algorithms on the data sets collected from YUMING and XIANGYANGHONG81 ships, and then use model of each algorithm to calculate the period and height of the waves on the test set respectively, and compare the calculation accuracy of each algorithm. Figures 5 and 6 show the actual measured average ocean wave period and height of wave video data collected at eight stations on the YUMING and XIANGYANGHONG81 and predicted values of various algorithm models, respectively. Both figures record video data of 11 days. As can be seen from the figures, blue curves represent actual measurements, and the estimated values from each model are represented by points of different colors. Each segment represents the actual value measured by the wave meter and the resulting value calculated by each algorithm model on the day of the horizontal coordinate date. For example, the period from March 6th to 10th indicates that the measurement value and the calculation value of each algorithm model are on March 6th. When the measured wave height and period are relatively high, there is a large deviation between the estimated value and the real value    of each algorithm. Especially when the actual wave period is more than 9 seconds and wave height exceeds 0.6 meters, the deviation of the estimation of wave period increases, but the estimation errors of 3D MCNN and 3D CNN are both relatively smaller than other algorithms and to a lesser extent, our 3D MCNN is not as good as 3D CNN, while results of the two measurements are nearly close to each other.

B. EVALUATION METRICS
In practice, the indicators used to assess the accuracy of predictions vary. In this experiment, correlation coefficient square(R 2 ), index of agreement(I ), root mean square error(RMSE) and its normalized (NRMSE), mean absolute error (MAE), are selected by us to evaluate the models. VOLUME 8, 2020 The scatter plot is employed to show them. The calculation formulas are as follows: where, x i and y i are the measured wave height, predicted wave height,x andȳ are mean measured wave height and mean predicted wave height, x max and x min are the maximum and minimum measurements, and n is the number of measurements.
In the experiment, we compared our model with the following existing prediction methods including EOFWNN, FRULER, HM-CCD. In addition, we also use the most basic 3D CNN method for experiments, and compare the results of our proposed 3D MCNN with all other methods.

1) HEIGHT COMPARISON OF OCEAN WAVE
We show a scatterplot representation of the prediction result of the index correlation coefficient square (R 2 ). Based on the experimental results, we compared the five methods.
As shown in Figure 7, the fitting relationship between the predicted value, namely, the estimated value of the algorithm model and the actual measured value is clear. The determination coefficients of the five models are 0.9876, 0.9883, 0.9943, 0.9990, 0.9986, respectively. By observing and comparing the predicted and theoretical values of wave heights of the four models, we can find that 3D CNN gives a better fit. The error between the estimated and measured wave heights calculated by 3D CNN is smaller than others, with its higher correlation coefficient. The fitting degree of 3D MCNN is slightly lower than that of 3D CNN, and its error value is still smaller than that of other algorithms.
To ensure our choosing model is best, we use different performance indicators to measure the estimation accuracy of wave height, and compare 3D MCNN with four other different estimation methods, and the results are shown in Figure 8.
We use a bar chart to summarize the results of several evaluation indicators in Figure 8. As you can see from the picture, R 2 and itself R of five model are very close to each other, but there are some differences in fact. ForR 2 , it can be seen from the analysis of figure 7 and itself is similar. And I varies from 0.8384 to 0.9777, 3D MCNN is second only to 3D CNN in five methods. The difference between the two methods is small, but they both get better performance than other methods. The summary results of RMSE, NRMSE, MAE are obviously different. We can see that for RMSE and NRMSE  indicators, EOFWNN gets the maximum value, MAE indicator shows that FRULER acquire the biggest value. The statistical results of the three indicators all express that the minimum value of the 3D CNN is obtained. We know that the smaller the error value, the better the prediction performance. Therefore, the 3D CNN obtains the best performance in wave height prediction. In addition, we observed that the RMSE difference between 3D MCNN and 3D CNN is only 0.65%, NRMSE is 1.19%, and MAE is 0.0124%. Therefore, we can see that their performance is relatively close.

2) PERIOD COMPARISON OF OCEAN WAVE
Similarly, scatter plots of wave periods for each model are as above.
As shown in Figure 9, it can be clearly compared that the goodness of fit of wave period values estimated by EOFWNN, FRULER and HM-CCD is not as good as that estimated by 3D CNN and 3D MCNN. The error between the estimated and measured wave periods calculated by 3D MCNN is small, and its correlation coefficient is 0.9945, which is only 0.0039 lower than 3D CNN.
As shown in Figure 10, for wave period, results of R 2 and itself R of five model are slightly more obvious than the wave height, and the difference between the best fit and the worst fit is 0.0208 and 0.0104, respectively. However, the fit of 3D CNN and 3D MCNN ranks first and second, respectively, and its square can be seen from the analysis of figure 9 in detail and itself is similar. And the maximum value of I is 0.8041, 3D MCNN is still second only to 3D CNN in five methods and its goodness of fit is 0.7074, which is higher than other methods. Compared with wave height, there are more obvious differences in the results of RMSE, NRMSE, MAE. According to results of NRMSE, EOFWNN gets the worst performance, 3D CNN and 3D MCNN are only 0.1335 and 0.1745 respectively. The MAE indicator shows that FRULER and EOFWNN have obtained comparable values, and the performance is not as good as the other three methods. The statistical results of the three indicators all indicate that 3D CNN still performs best, and 3D MCNN performs second, so 3D CNN has obtained the best performance in wave period prediction. Besides, we observed that the difference of RMSE, NRMSE, and MAE of 3D MCNN and 3D CNN is 0.2006, 0.041, 0.1771, respectively. The difference is slightly higher than the wave height detection result, and the performance is still relatively close.
To further demonstrate the performance improvement of our model relative to other methods, we use IMT = ((I 3D MCNN − I base )/(I base )) × 100% to calculate improvements, where I base is the index of agreement for the benchmark method, while I 3D MCNN is the index of agreement of the 3D MCNN. The performance improvements of R 2 andR are calculated in a similar way. For other two metrics NRMSE and MAE,we use a manner as RMSE, which is as follows.
where RMSE base is the RMSE for the benchmark method, while RMSE 3D MCNN is the RMSE of the 3D MCNN.    FRULER, HM-CCD. We can see from the evaluation results in Figure 8 and figure 10 that the performance of our proposed model is close to that of 3D CNN. In order to make a detailed comparison, we put the information of the two in a table and the results are as follows.
According to TABLE 4, it shows the comparison of the 3D MCNN model using depth separable convolution with the 3D CNN model using full convolution. In TABLE 4, the results of both statistics indicate that, compared with 3D CNN, 3D MCNN is slightly inferior to 3D CNN in the measurement metrics of wave height and period for actual measurement.
The intuitive data representation directly shows the difference in metrics R 2 , R, MAE, which is within a range of less than 1% for height and period. For NRMSE, the difference of both is not more than 4.1%. For RMSE, height data is still relatively small, and the difference in period data is slightly larger, and so is index of agreement indicator. However, we observed that for computational overhead, 3D MCNN significantly saves tremendously on mult-adds, prediction time and parameters, greatly reducing the computational cost.
Due to the complexity of the marine environment, devices deployed on facilities such as mobile observation points are usually edge devices with limited resources, and their memory and computing power are limited. As the scale of the parameters in the model is larger, the demand for hardware resources is higher, which is not conducive to the deployment of the model on the observation equipment. In order to realize that the trained learning model can be well deployed on mobile devices to complete inference tasks and obtain reliable and faster inference results, it is the ideal method to select fewer model parameters. In addition, the lighter the model, the lower the equipment cost. Therefore, we selected 3D MCNN as our final best model.

V. CONCLUSION
In this paper, we develop a low-cost, high-accuracy detection scheme named iOceanSee for ocean wave parameters using 3D Mobile Convolutional Neural Network. iOceanSee is a light-weight implementation version of 3D Convolutional Neural Network specifically to detect wave period and height in the marine environment. Firstly, depth separable convolution is adopted to construct lightweight networks to reduce computational overhead. Then, a mobile terminal composed of low-cost measuring equipment and noninterference (except light) devices-an RGB camera is used to collect video data in real time on vessels named YUMING and XIANGYANGHONG81. Secondly, we extract features from both space and time dimensions, thus capturing the encoding motion information in multiple adjacent frames in the video using our model. Finally, we compare our scheme with other models. The experimental results show that this method gets performance comparable to 3D CNN and obtains higher measurement accuracy than existing methods such as EOFWNN, FRULER and HM-CCD, for both height measurement and periodic measurement.
In future work, we intend to use the ocean data collected from other observation stations to pre-train our model to pursuit of higher accuracy and lower cost. In addition, we will improve our architecture, such as the use of attention mechanisms, in hope of achieving other tasks in the marine field.