Maritime Visible Image Classification Based on Double Transfer Method

Image classification using deep transfer learning has received significant attention, benefiting from pre-trained with the large-scale annotation dataset and continuous improvement of neural network structure. In contrast to universal image classification, however, few publicly available datasets of maritime environments utilize deep transfer learning. Due to data-gathering effort and computational cost, the maritime datasets are deficient in the method of merging datasets and the benchmark of few-shot dataset classifiers. In this article, we proposed the double transfer method, consisting of the merging datasets network and the backbone network, to address the problem. The merging datasets network measuring image similarity separates classes of known and unknown samples to reorganize a dataset, and the backbone network is constructed from the model EfficientNet-b5 by network-based deep transfer learning. Using the merging datasets network, we introduce the visible maritime image dataset, which has 3,750 images and twenty-five classes, including multitudinous maritime objects. The backbone networks evaluated and analyzed the dataset based on accuracy, precision, recall, and F-measure metrics. Using the double transfer method, we can achieve an accuracy of 91.39% in the visible maritime image dataset.


I. INTRODUCTION
In the last decades, the application of computer vision in maritime environments can monitor marine traffic and sea border activities, detect the ordinary ship with illegally occupying individual channels and anchorage grounds, and collect intelligence on illegal fishing and smuggling [1]- [3]. Large-scale annotated datasets train convolutional neural networks (CNNs) with supervised learning to achieve image classification in recent years [4]. However, the process of data training suffers considerable data-gathering effort and computational cost, and publicly available datasets are scarce in maritime environments [5], [6]. There are some maritime datasets, such as object detection and tracking [7], piracy detection [8], and obstacle detection [9], [10], but none of them is used for maritime classification [6]. Compared to the remote sensing datasets and infrared (IR) datasets [2], the electro-optical visible dataset is insufficiently applied to maritime environments [6], [7]. The majority of maritime visible classification datasets is closed source to The associate editor coordinating the review of this manuscript and approving it for publication was Hongjun Su.
difficultly obtain, such as the Maritime Detection Classification and Tracking (MarDCT) datasets [11] and E2S2-Vessel dataset [12]. Besides, rare categories are hard to collect, and a few instances toughly train a deep CNNs of a satisfying accuracy. The visible and infrared ship imagery (VAIS) coarse-grained dataset [13] and the MARitime VEsseLs (MARVEL) fine-grained dataset [14] are the few publicly available datasets in the maritime image classification. However, the class imbalance appears in the datasets, which affect the result of image classification [6], [15]. In contrast to the two publicly available datasets, few datasets merge different existing annotated maritime datasets and unlabelled data in maritime environments for image classification.
The state-of-the-art CNNs using deep learning achieved significant success in image recognition, which are the effective models to tackle a variety of visual tasks [4], [16], [17]. However, the CNN application of the ship classification in maritime environments began in 2015 [13]. Due to the shortage of well-labeled dataset and robust algorithm or specific benchmark process, the majority of solutions dwell on coarse-grained classification instead of VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ fine-grained or instance-level image classification [11], [14], [18], [19]. Numerous approaches remain the design and improved CNN model based on AlexNet and VGGNet [4], [12]- [14], [18]- [21]. The state-of-the-art CNN model, such as EfficientNet, is rarely used in the ship classification with the VAIS dataset or the MARVEL dataset [22]. The main contributions of this article are summarized as follows: (1) The double transfer method was proposed, consisting of the merging datasets network and the backbone network. The merging datasets network measuring image similarity reorganize disparate datasets to obtain fine-grained datasets. Then, it is used for the Visible Maritime Image (VMI) dataset, which has 3,750 images and twenty-five class. The network eliminates class imbalance [15], and improve similarity in the same class. (2) Compared with the elementary convolutional neural network, the backbone network is structured with the model EfficientNet-b5 using deep transfer learning, and draw a comparison with AlexNet, VGGNet, ResNet, DesnseNet, SqueezeNet, and other EfficientNet backbone networks. The backbone networks evaluated and analyzed the VMI dataset based on accuracy, precision, recall, and F-measure metrics.
The rest of this article is organized as follows. Section II reviews the maritime dataset and image classification. Section III introduces the double transfer method consisting of the merging datasets network and the backbone network. Section IV describes the VMI datasets by the merging datasets network. Section V reports experiments and results with the VMI datasets using the backbone network. Finally, the conclusion and discussion are given in Section VI.

II. RELATED MARITIME DATASETS
In the maritime image classification, the commonly available datasets are the VAIS dataset and MARVEL dataset.

A. THE VAIS DATASET
The VAIS dataset consists of 1,623 visible images in six coarse-grained categories [13], consisting of merchant ship, sailing ship, medium passenger ship, medium ''other'' ship, tugboats, and small boat. To analyze the quality of the dataset it is necessary to check if the classes are equally distributed [6]. Fig. 1 shows the distribution of ''official'' training and test set, which partition the VAIS dataset into about (54%, 46%) ratio of six coarse-grained categories [18], [19]. It is indicated that The VAIS data have a strong class imbalance. Another important property for image classification is image resolution. The majority of image resolution is less than 50,176 (224×224) pixels which is less than the input resolution of lots of CNNs, and 44 images have insufficient 1,024 (32 × 32) pixels with losing local details which ultimately results in lower performance of image classification [2], [4], [17], [20], [22]- [25].
Zhang et al. [13] presented the gnostic field and CNN that has been pre-trained on ImageNet, and use it to extract features from VAIS for image classification to achieve the best accuracy of 82.4% with only using the gnostic field. However, the fine-tuning of a pre-trained model yields poor performance, an accuracy of 81.9%, which less than 82.4%. Khellal et al. [26] proposed the problem of overfitting due to the lack of training data, even when using regularization techniques. Shi et al. [19] proposed the method consists of a multi-feature ensemble based on AlexNet CNN (ME-CNN), and use a two-dimensional discrete fractional Fourier transform (2D-DFrFT) with amplitude (M) and phase (P) to extract information. Using the ME-CNN with training from scratch achieves the best accuracy of 87.33%. Then, Fine-tuning CNN with multi-scale rotation invariance features is proposed to achieve an accuracy of 88.00% [18].

B. THE MARVEL DATASET
The MARVEL dataset is a large-scale, fine-grained image dataset, including 26 superclasses and 239,622 images. The dataset has an equal number of examples in each superclass from official that the proportion of the training set and the test set is 8:1 (88.89% and 11.11%) [14]. All images of the dataset is greater than 155,236 (approximately 394×394) pixels. The critical property of fine-grained image classification is vessel taxonomy. Statcode5 ship coding system proposed by IHS Markit is a standard ship type that is easier to see a logical breakdown and allocation [3], [27]. The image of MARVEL is collected from website shipspotting [28]. However, the superclasses of the dataset contain the ship and vessel, but no other maritime targets are considered, such as boat and buoy, shown in Table 1. Besides, the statistics show that of all those classes, the followings are the dominant ones (over 80%) in terms of the number of images in reality: cargo ship, tanker, bulk carrier, container ship and passenger ship [12], [14].
Gundogdu et al. [14] proposed the application of AlexNet to achieve an accuracy of 73.15%. However, using SVM and pre-trained VGGNet only obtain an accuracy of 53.89%, and few types, such as livestock carrier, fire fighting vessel, and training ship, have only a precision of 20%-50%. Leclerc et al. [3] proposed the application of GoogLeNet to achieve an accuracy of 78.73% in the MARVEL dataset.
Besides, some data prove that training from scratch can achieve better experimental results [2]. He et al. [29] believes that proper pre-training can make up for the lack of data to bring a significant improvement in the case of small data sets and insufficient computing power and shorten the research time. Also, Yosinski et al. [30] considers that model A is trained with the training set A-Data, and fine-tuning can reach a level similar to the original on the dataset B-Data. The number of fine-tuning layers has no apparent relationship with accuracy. Donahue et al. [31] showed that features extracted from a deep CNN trained on a large-scale dataset could be reused for new generic tasks.

III. DOUBLE TRANSFER METHOD
Deep Learning is a branch of machine learning, which contains a variety of state-of-the-art CNNs models. Generally speaking, machine learning is a computer program, which is said to learn from experience E with respect to some classes of task T and performance measure P if its performance can improve with E on T measured by P. However, deep learning is trying to obtain perfect performance due to restricted to the few-shot data and insufficient model training. Few-shot learning (FSL), which is a type of machine learning problem (specified by E, T , and P), where E contains only a limited number of examples with supervised information for the target T , is a particular solution of deep learning [5]. Few-shot and fine-grained dataset is the FSL problems, which is difficult from the classification of a large-scale dataset. The double transfer method consists of two aspects that one is merging multiple data sources, and another is the adjustment of the CNNs model with network-based transfer learning to address the problem. First of all, the notation and definition are shown in the following.

A. NOTATION AND DEFINITION
The double transfer method is based on transfer learning, whose most important concepts are domain and task.
A domain D is composed of two parts, i.e., a feature space X and a marginal probability distribution P(X ), i.e., D = {X , P(X )}. P(X ) is hard to obtain its explicit formulation. The symbol X denotes an instance set, which is defined as X = {x|x i ∈ X , i = 1, · · · , n}. A task T consists of a label space Y and the target prediction mapping f : X → Y, i.e., T = {y, f (x)} [32], [33].
Therefore, the definition of transfer learning can be obtained. Given some/an observation(s) corresponding to m S ∈ N + source domain(s) and task(s) (i.e., {(D S i , T S i )|i = 1, · · · , m S }), and some/an observation(s) about m T ∈ N + target domain(s) and task(s) (i.e., transfer learning utilizes the prior knowledge of experience E implied in the source domain(s) to improve the performance of the learned decision functions f T j (j = 1, · · · , m T ) on the target domain(s) [5], [32]. Then, there is the definition of deep transfer learning. Given a transfer learning task defined by D S , T S , D T , T T , f T (·) . It is a deep transfer learning task where f T (·) is a non-linear function that reflected a deep neural network [33].
However, we can only take a finite number of samples from the domain D. So, all hypothesis mappings h : X → Y make up hypothesis space H. The effect of deep transfer learning on the hypothesis space is shown in Fig. 2. In the following discussion, only a limited sample is considered.  [34], is the search path. h I is empirical best, which is the achievable hypothesis mapping. h * is best in H, which demands the state-of-the-art CNN model.h is optimal mappings, however, which demands the nonexistence of unknown knowledge. Deep transfer learning using prior knowledge reduce restricted factors for the best h * .
In the classification, given the sample set is defined as , N is the total number of samples. Assuming the label y i ∈ {1, . . . , c}, c ≤ N and the mapping hypothesis h(D C ) that is trained, the predicted set is , and x C i defined as and g, u i , s i are defined as where, 1 < m, c n , n ≤ k; g is the identical feature in all classes C; u i is the similar feature in the correlation class c n , c n < C; s i is the special feature in each classes. g and u i constitute prior knowledge. Actually, due to g describe the feature of all classes, (1) can be simplified to Meanwhile, the aim of the classifier find optimal results which is no correlation between the x C i . In other word, where ||U|| 2 the L2 norm of U. With the classifier, there is at least a probability of (1 − δ) to learn the correct classes (The error is at most ) [35]. However, the feature U is hardly obtained directly. We use the accuracy ACC from the confusion matrix instead of ||U|| 2 to measure the association between classes. The confusion matrix is defined as The accuracy ACC is defined as Therefore, (4) can be redefined as where P(·) denotes the unknown probability distribution. Commonly, only the correct ACC is used to measure the classification ability of the network. Then, precision represents the proportion of correct positive samples predicted for all positive samples in the same class c, defined as Recall represents the proportion of correct predictions in all samples whose true label is class c, defined as F measure is harmonic mean with precision and recall, defined as

Algorithm 1 Merging Multiple Data Sources
Target sample x j is labeled, {x j , y j } and N l + +; end if 12: end if 13: end for 14: then 16: Maximum of {x i , y i }, the label is outlier; 17: end if 18: end while 19: where, β measures the relative importance of R to P. Generally, β = 1 is selected, and F measure is called F-1. Precision, recall, and F measure a single class of metrics.

B. MERGING DATASET NETWORK
Source domain consists of the different dataset with dissimilar labeled system, and target domain have non-labeled samples. Given a set of domain to acquire the source domain with labeled samples and the target domain without unlabeled samples. At least each class c have ten sample c x ic ≥ 10, c ∈ C s in the source domain. The merging datasets network includes a feature extractor G c and a classifier G hi , i ∈ {1, 2}. G c is trained with source sample {x i , y i } based on a pre-trained AlexNet model using deep transfer learning. G h1 is C s classifier, and G h2 is (C s + O s ) classifier, where the O s class has only one label ''outlier''. Thus, the loss L train of the trained feature extractor and classifier G h1 (G c ) is defined as: The outputs are the probability of that sample belonging to known class c. So, we use the average probabilityw S i to measure similarity between target sample x j and source class, defined as wherew S c is the average probability matrix of class c in the source domain. w T j is outputs of G h2 (G c (x i ). The loss L test of the tested G h2 (G c ) is defined as: Inputting x j to G h2 (G c ), we define the similarity with two target sample which is where (13) measure the maximum similarity that target samples j is greater than target samples j − 1. The current sample x j is labeled to (x j , y j ), andw S c changes with an increase in the labeled data, shown in Algorithm 1 row 10.
After the complete classification of the target sample, a few classes contain more samples than the sample maximum N c . Therefore, the minimum similarity value in all samples are compared to find it as an outlier class. Finally, we obtain the new dataset D N = {x n , y n } N n n=1 . The algorithm is shown in Algorithm 1. Besides, N l is the number of the target labeled sample in Algorithm 1. N l is initialized to N l = 0, and N l only record classes C s rather than classes (C s + O s ).

C. THE BACKBONE NETWORK
We proposed the backbone network using network-based deep transfer learning. Network-based deep transfer learning refers to the reuse of the partial network that pre-trained in the source domain, including its network structure and connection parameters, transfer it to be a part of deep neural network which used in the target domain [33].
First, a large-scale dataset ImageNet is used to train the backbone network G hs (G p1 (G t )) to obtain optimal parameters. G hs is a source domain classifier. (G p1 (G t )) is a feature extractor. G t is part of the network to be transferred. Then, the first few layers G t of the network are transferred to be a part of the transferred backbone network G ht (G p2 (G t )), where G ht is a target domain classifier, and (G p2 (G t )) is a feature extractor. Finally, the target domain trains G ht (G p2 (G t )), shown in Fig. 3. The loss of the trained feature extractor (G p2 (G t )) is defined as where L is cross-entropy loss.
In the method, we have chosen the backbone network with the EfficientNet model, which is the state-of-the-art network structure in image classification. The backbone network using the EfficientNet model compares with AlexNet, VGGNet, ResNet, DesnseNet, and SqueezeNet backbone networks. The structure of the proposed CNNs is shown in Fig. 4.
AlexNet is first used a large-scale dataset to train CNN [4], seen in Fig. 4(a). Compared to other networks, VGGNet has the largest network parameters and occupies more memory with higher complexity to improve accuracy [20], [25]. The VGGNet-11, VGGNet-16 are chosen in this article, seen in Fig. 4(b). SqueezeNet uses model compression to achieve AlexNet-level accuracy on ImageNet, reducing parameters by 50 times, and proposes a Fire structure [17], shown in Fig. 4(e)(i). ResNet introduces deep residual learning framework, which is where H (x) is the desired underlying mapping, with x denoting the inputs to the first of these layers to these layers approximate a residual function F(x) [23], seen in Fig. 4(c)(g). DenseNet is built based on ResNet [24], showing in seen in Fig. 4(d)(h). Therefore, ResNet-18, ResNet-50, ResNet-101, DenseNet-121 and DenseNet-169 are chosen in this article, Besides that, EfficientNet can provide higher accuracy by adjusting the network depth d, width w, and resolution r. EfficientNet-b5 and EfficientNet-b7 use a compound scaling method to scale up from EfficientNet-b0, shown in Fig. 4(f)(j). The compound scaling method is defined as where α, β, γ are constants that can be determined by a small grid search. φ is a user-specified coefficient that controls how many more resources are available for model scaling, while α, β, γ specify how to assign these extra resources to network depth d, width w, and resolution r respectively [22].

IV. VISIBLE MARITIME IMAGE DATASET
The merging dataset network is used to create the Visible Maritime Image dataset. The class of the VMI dataset VOLUME 8, 2020  refers to the classified categories of the SMD, the VAIS dataset, the MARVEL dataset, the naval ship dataset, and Statcode 5 Shiptype coding system to satisfied with the diversity of the maritime environments [7], [13], [14], [27], [36].

V. EXPERIMENT AND RESULTS
In this section, the backbone network with network-based deep transfer learning is tested separately in the VMI dataset, and the backbone networks evaluated and analyzed the dataset based on accuracy, precision, recall, and F-measure metrics.To verify that deep transfer learning can improve the recognition ability of the networks and reduce the training time, we designed partition experiments tested the impact of the dataset proportion on the accuracy of the test set. AlexNet is determined to the benchmark network, which is the publicly used neural network structure to obtain a set of benchmark data, and the remaining networks are tested to compare with the benchmark network and prove that EfficientNet-b5 is the most efficient backbone network. The method is evaluated and analyzed with evaluation metrics, including accuracy, precision, recall, and F-1 measure. Finally, the backbone network with network-based deep transfer learning is tested in the EO images of the VAIS dataset to prove the efficiency of the proposed network. We use a single NVIDIA GeForce GTX 2080 Ti GPU with 11GB of GPU memory for training, an Intel Core i5-8400 processor, and 16GB of RAM.
First, different proportions of the training set to test set are used to the benchmark network AlexNet. The five proportions are (70%, 30%), (75%, 25%), (80%, 20%), (85%, 15%), (90%, 10%). There are 100 epochs in the CNNs test to verify the ability of deep transfer learning, which can reduce the training epoch and receive a better effect. The AlexNet uses ImageNet to pre-trained. The number of the transferred layers has no obvious relationship with accuracy [30], but inconsistent categories fail to classify. So, the last fully connected layer of the AlexNet is replaced with 25 categories of the VMI dataset, shown in Fig. 4(a). Additionally, we choose 224 × 224 image sizes based on the relationship between network performance and image resolution [2], [4], [17], [20], [22]- [24], and use Adam optimizer [34] with 0.001 learning rate, and cross-entropy loss function. The result is shown in Table 2. As can be seen in Table 2, (85%, 15%) is the best average results. In case 1 and case 2, the divided datasets are used with the same proportion and different images between training and test set to test the benchmark network AlexNet. In case 1, AlexNet without deep transfer learning achieves a maximum accuracy of about 10% to demonstrates weak classification performance with severe underfitting. In case 2, AlexNet based on network-based transfer learning achieves an accuracy of 79.16% in 100 epochs. In five type, three cases (70%, 30%), (75%, 25%), (80%, 20%) is underfitting and only one case (90%, 10%) is overfitting, that the extracted feature of the training set obtain large ||U|| to overfit. Therefore, (85%, 15%) is chosen to divide the training and test set by random sampling, and the same maritime images only belong either to the training or test dataset but VOLUME 8, 2020 not both. The training set has 3188 images, and the test set has 562 images. In other words, the average class imbalance ratio between different categories is less than 0.7%. However, the merging datasets network might misclassify the target dataset D T and delete correct classified images of D S for image balance. The label noise appears in the VMI dataset. Deep networks are known to be relatively robust to label noise, which has little effect on the results [37]. Therefore, the backbone networks evaluated and analyzed the VMI dataset.
Then, the classification capabilities of the backbone networks are tested three times in the VMI dataset. The backbone networks, including AlexNet, VGGNet (VGGNet-11, VGGNet-16), ResNet (ResNet-18, ResNet-50, ResNet-101), DenseNet (DenseNet-121, DenseNet-169), SqueezeNet, and EfficientNet (EfficientNet-b0, EfficientNet-b5, EfficientNet-b7), are measured by accuracy among 100 epochs to obtain the maximum accuracy every 25 epochs. The backbone networks based on network-based transfer learning are pre-trained in ImageNet. We use Adam optimizer with 0.001 learning rate, and cross-entropy loss function, and 224 × 224 input size is chosen to train the backbone networks. The results are shown in Fig. 8, Table 3, and Fig. 9.    distance between the median value and the maximum value is close for less training epochs, such as ResNet-50 and SqueezeNet. The outliers, which represent training instability, are outside the range of maximum and minimum, such as ResBet-18, DenseNet-169, and SqueezeNet. The probability of the outliers is less than 0.7%. Intuitively, EfficientNet reach better performance than the other backbone networks, and the accuracy of EfficientNet-b5 is the highest. These results confirm EfficientNet transfer well and achieve state-of-theart accuracy [22]. Table 3 shows that the maximum accuracy of the backbone networks on the VMI test set with respect to the parameters of models and the computational complexity of the considered architectures for a single forward pass, namely floating-point operations (FLOPs) [25]. Fig. 9 reports the accuracy versus the parameters and FLOPs of the backbone networks. The ball size corresponds to the FLOPs of models. Intuitively, with the increase of parameters and FLOPs, the accuracy of the same series backbone networks is constantly improved, such as VGGNet, ResNet, DenseNet, and EfficientNet. EfficientNet-b7 has the most parameters, but the accuracy of the network is lower than EfficientNet-b5 due to the memory of GPU limited by the parameters and FLOPs. Simultaneously, under similar parameters, the improvement of the network structure achieves the higher accuracy, such as improved depth of VGGNet [20], deep residual learning framework of ResNet [23], dense block of DenseNet [24], and fire layer of SqueezeNet [17], shown in Fig. 4. Therefore, compared with the other backbone networks, the EfficientNet-b5 reaches the highest accuracy with similar parameters and FLOPs. Then, precision, recall, and F-measure are evaluated in EfficientNet-b5 to analyze the classification of each category. The confusion matrix reveals the categories to which each image in the test set is assigned, and off-diagonal categories are wrongly classified, shown in Fig. 10. The class bulk carrier and many other classes have misclassifications with each other, such as tanker and open hatch cargo ship. The vessels are used to load bulk cargo and have a similar appearance resulting in an error of classification. At the same time, the precision, recall, and F-1 measure data indicate that different types of images have different prediction labels in Table 4. The majority of classes have high similarity in the same class. Seventeen categories have about 91%-100% of F-1 values, and the accuracy rate of EfficientNet is 91.39%. However, the F-1 measure of eight categories, consisting of barge, bulk carriers, combat vessel, destroyer, fishing vessel and boat, jetski, open hatch cargo ship, and tanker, are less than 0.91.
The backbone network EfficientNet-b5 is trained and tested in the VAIS dataset to validate the effectiveness of the proposed method. The training process of the backbone networks use the same parameters as the VMI datasets. Adam optimizer with 0.001 learning rate and cross-entropy loss function is chosen, and 224 × 224 image resolution is used to input size of the proposed method. The accuracy of a few state-of-the-art methods is compared in the VAIS dataset,  shown in Table 5. In contrast to an accuracy of 88.00% achieved by the previous best method Gabor + MS-CLBP + CNN + SVM, the proposed method EfficientNet-b5 using network-based deep transfer learning obtains an accuracy of 98.53%.
The performance of the two methods, which are the previous best method and the proposed method, are analyzed by the precision, recall, F-1 measure, and execution time. Figure. 11 shows that the confusion matrix reveals the class to which each image is assigned in the test set of the VAIS dataset, and off-diagonal categories are wrongly classified. The left confusion matrix is the test result of Gabor + MS-CLBP + CNN + SVM [18]. The right confusion matrix is the test result of the proposed method EfficientNet-b5 using network-based deep transfer learning. The 11 samples are only misclassified among the 750 samples in the VAIS dataset's test set using the proposed method. Therefore, the precision, recall, and F-1 measure in each class is calculated from the confusion matrix. Table 6 reports the precision, recall, F-1 measure, and execution time in the Gabor + MS-CLBP + CNN + SVM and the proposed method. The execution time [18] indicates training and test time of the VAIS dataset to achieve the best performance, which evaluates the performance of the model mixed with other algorithms, rather than FLOPs that only reflect the performance of CNNs. Compared with Gabor + MS-CLBP + CNN + SVM, the precision, recall, F-1 measure using the proposed method are improved, achieving better performance. The execution time of the proposed method is about 892 seconds, which is similar to the previous method. The backbone network EfficientNet-b5 using network-based deep transfer learning is proved to achieve state-of-the-art accuracy, precision, recall, and F-1 measure, tested in the EO images of the VAIS dataset.

VI. CONCLUSION AND DISCUSSION
We proposed the double transfer method to obtain a new dataset, the Visible Maritime Image dataset, consisting of ship, vessels, boat, and others of maritime environments, to expand the scope of maritime image classification and conform to the actual situation. The accuracy of the test set obtained 91.39% using the backbone network EfficientNet-b5 by deep transfer learning to verify that small fine-grained datasets can achieve higher accuracy. Simultaneously, the backbone network EfficientNet-b5 using the same parameters was trained on the VAIS dataset to achieve the best accuracy of 98.53%, which is improved by about 6% to 10% than other algorithms. It proves that the backbone network is robust.
The VMI dataset might be altered into five coarse-grained categories, consisting of passenger (passenger ship, sailing vessel, sailing boat, speed boat, jetski, cabin boat, windsurfer, yacht), passenger/freight (barge, passenger, and ro-ro cargo ship), merchant (bulk carriers, container ship, dredging ship, fishing vessel and boat, open hatch cargo ship, platform supply ship, ro-ro cargo ship, tanker, tugboat), navy (aircraft carrier, combat vessel, destroyer, replenishment vessel) and other(buoy, platform). However, class imbalance appears in the classification of five coarse-grained categories.
Through experimental results, the backbone network EfficientNet-b5 is not possible to classify ships with the same ship style but different cargo loads, such as bulk carriers and tankers, which dominates the sea transportation of bulk cargo [12], [27]. Besides,the maritime dataset may include label noise [37], [39]. The EO camera can only capture images within a few kilometers, and it needs to have enough light sources to collect data [40]. Also, in the case of fog, snow, and rain, the image quality of EO is blurred and not easily identified due to being limited by camera parameters [7]. Fortunately, a few methods to reduce label noise [41], [42], a small object detection method [43], increasing the brightness of night image [44], dehazing [45], rain and snow removal [46] are proposed, which have strengthened the application of EO camera for image processing and computer vision. So, the new ship classifier will be integrated with not only the image captured from the EO camera but also the information of multi-sensor data fusion to improve image classification and recognition.
JIANHUA WANG received the B.E. degree in fluid control and manipulation from Beihang University, Beijing, China, in 1987, the M.S. degree in pattern recognition and intelligent system from Chongqing University, Chongqing, China, in 2004, and the Ph.D. degree in pattern recognition and intelligent system from Shanghai Jiao Tong University, Shanghai, China, in 2007.
Since 2007, he has been an Associate Professor with the Institute of Logistics Science and Control Engineering, Shanghai Maritime University, Shanghai. He is the author more than 30 articles and holds more than 15 patents. His research interests include navigation, guidance, and control for unmanned surface vehicles, consisting of object detection and tracking, SLAM, path planning and tracking, and embedded systems.