Ground-Based Cloud Image Recognition System Based on Multi-CNN and Feature Screening and Fusion

The recognition of ground-based cloud images has rich application prospects in many aspects such as weather prediction, astronomical site selection, and meteorological observation. Affected by factors such as rotation and illumination, the traditional feature extraction method is difficult to accurately describe the features of cloud images, resulting in low accuracy of ground-based cloud image recognition, and cannot meet the requirements of practical applications. With the popularity of convolutional neural networks in image processing, ground-based cloud image recognition algorithms based on convolutional neural network have become a research focus. However, the features of the ground-based cloud image are relatively shallow, and the cloud texture and other features are seriously lost in the convolution process, and it is difficult to achieve a good recognition effect. This paper proposes a ground-based cloud image recognition system based on multi-scale convolutional neural network (Multi-CNN) and multilayer perceptron neural networks (MLP). The multi-level and multi-scale convolution feature extraction is performed through convolution layers of Multi-CNN, and the local features with strong resolving power are selected through the feature screening algorithm based on DP clustering. Finally, the local features are encoded and fused for cloud image classification based on MLP. Filed test results show that our method was superior to other tested network models in terms of the recognition accuracy of 94.8% under 9 classification. In addition, ablation experiments show that the multi-scale feature extraction, screening and local feature coding in this paper have a significant effect on improving the algorithm’s ability to distinguish different cloud images.


I. INTRODUCTION
Clouds play an important role in the conservation of the earth's energy and is an important factor affecting global climate change. Cloud detection is an important part of meteorological observation. Accurately obtaining cloud information is of great significance to many fields such as flight support, weather forecasting, climate research, and astronomical site selection [1], [2].
In the past, the detection of clouds was generally based on manual observation. With the development of the manufacturing industry and the increasingly widespread application of automated observation technology in various fields such The associate editor coordinating the review of this manuscript and approving it for publication was Ramakrishnan Srinivasan . as medicine and agriculture, cloud observation methods have also shifted from manual observation to automated observation [3]. A variety of sensor devices are increasingly used in cloud detection. However, the main disadvantages of this type of instrumental measurement methods based on physical characteristics are that the observation results have low resolution, and are easily interfered by factors such as air impurities and moisture, resulting in large observation errors [4], [5].
Researchers began to explore cloud image observation methods based on machine vision. Cloud images can be obtained through satellite and ground observation. Large-scale cloud information can be obtained from satellites, but the low resolution of this information can easily cause misjudgments [6]. For example, limited by the viewing VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ angle of the radiometer, small clouds are often ignored; low clouds, thin clouds, and the surface may be confused. The data obtained by ground-based observation equipment can reflect the microstructure information of the cloud, which can make up for the lack of satellite observation data and is an important cloud observation method [7]. At present, meteorological departments have achieved the estimation of cloud height and cloud volume through instrumental measurements, but the classification of ground cloud images can't be achieved through instrumentation. Visual-based cloud image recognition algorithm has gradually become a hot spot [1], [2]. However, the shape and angle of clouds vary greatly, and it is difficult to distinguish clouds with large similarity based on traditional feature extraction algorithms. In addition, the annotation of satellite cloud maps is difficult to meet the training needs of deep convolutional networks, and most of the current network models can only distinguish cloud maps simply, and can't achieve detailed classification. In order to meet the actual application needs, this paper proposes a ground-based cloud image recognition system based on convolutional neural network and feature selection and fusion. The main contributions of this paper are as follows.
1) A new ground-based cloud image feature extraction algorithm is proposed. Instead of directly using convolutional neural networks and fully connected layers for image recognition, the Multi-CNN in this article is only used to extract multi-scale features without participating in the image recognition process.
2) The Multi-CNN in this paper uses a pre-training model for feature extraction, and does not need to use a large number of ground-based cloud images to train the network, which enables the system to achieve good performance even in the absence of samples. 3) Instead of directly using convolutional features for image classification, this paper designs a feature selection and fusion network. The redundant local features are filtered, and the highly recognizable local features are encoded and fused, which improves the network's ability to distinguish different cloud images.

II. RELATED WORK
Since observation equipment based on physical characteristics can usually only obtain some low-resolution numerical results, it is difficult to compare with manual observations and can only be used as auxiliary data for observations. Therefore, various meteorological observation stations often adopt automatic observation equipment based on visible light images to obtain ground-based cloud images, and then analyze the conditions in the sky through image processing or manual image reading. Commonly used vision-based ground-based cloud observation equipment is divided into two types: one is a scanning-based all-sky imaging system; the other is a fisheye-based all-sky imaging system. Scanning-based all-sky imaging systems usually use lowcost cameras and lenses, mounted on a rotating pan/tilt that can rotate 360 • and adjust the pitch angle within a range of 0 • to 90 • . This type of imaging system is more used for observation tasks that require high accuracy for a certain fixed azimuth observation but low requirements for sky panoramic observation, high requirements for cloud observation and low requirements for cloud amount estimation [8]. The all-sky imaging system based on fisheye lens mainly relies on the large field of view of fisheye lens (usually close to or even more than 180 • ) to achieve the purpose of imaging the sky panorama at one time. This type of imaging system is usually suitable for observation tasks that require high panoramic observation and emphasize cloud cover estimation [9].
The classification and recognition of cloud shapes is an important and difficult task in ground-based cloud observation tasks. The international cloud classification system developed by WMO (World Meteorological Organization) divides clouds into ten categories and 28 sub-categories [10]. In cloud observation, there are mainly 8 types of clouds that need to be observed and recorded: cumulus, stratus, stratocumulus, altocumulus, altocumulus, cirrostratus, cirrocumulus, and cirrus. These 8 kinds of clouds with the clear sky, a total of 9 categories, constitute the classification standard for manual observation. The classification and recognition of clouds is much more difficult than the recognition of other common targets. Therefore, in many cloud recognition methods of ground-based cloud images, the task of cloud recognition is simplified, and only some types of clouds are performed. Identify or simply reclassify clouds according to their thickness [5], [11], [12]. For example, Soumyabrata et al. [13] distinguished three types of sky, thin cloud, and thick cloud only on the basis of ground-based cloud detection. These methods are mainly applied to the simplified cloud classification standard, that is, some similar cloud shapes are combined as the same category for identification. The actual 9 types of cloud shape recognition are not satisfactory, and it is difficult to meet the application requirements.
In the ground-based cloud recognition method, one of the most important factors is the feature extraction of the ground-based cloud image. The stronger the expression ability of the extracted image feature, the better the recognition effect. Liu et al. [14] proposed a method integrates the high-level fusion and the output of low-level fusion with deep visual features and deep multimodal features. Shi et al. [15] believe that local rich texture information may be more important than global layout information, so based on the characteristics of shallow convolutional layers, a DCPS network is proposed to classify cloud images. Ye et al. [16] based on deep convolutional features to extract features of ground-based cloud image. With the application of convolutional neural network in image processing, the number of cloud image recognition algorithms based on convolutional neural network is increasing. However, the background, texture and other features of the ground cloud image are of low complexity and feature loss is easy to occur in the process of deep convolution. Therefore, how to appropriately extract the features of the cloud image through the convolutional neural network has become an urgent problem to be solved.
After obtaining the feature description of the ground-based cloud image, selecting a suitable classifier according to the extracted ground-based cloud image features also affects the final effect of ground-based cloud recognition to a certain extent. Chethan et al. [17] proposed a texture feature cloud classification method based on Gabor transform, and used the improved classifier Support Vector Machine (SVM) for classification. Alireza et al. [18] use multilayer perceptron (MLP) neural networks and support vector machine (SVM)] capabilities for automatic cloud detection in whole-sky images. Although the two approaches generally generate similar accuracies, the MLP neural networks gave a better performance in some specific cases where the SVM generates poor accuracy.

III. PROPOSED METHOD
A cloud recognition system based on multi-scale convolution feature extraction, screening and fusion is proposed in this paper. The system consists of four parts. The first part is based on the CNN model for image feature extraction at different scales. The second part is the feature screening module, which screens out the more discriminative features from the extracted local features. The third part is feature coding fusion, and the last part is cloud classification. The system architecture diagram is shown in Fig. 2

A. MULTI-SCALE LOCAL FEATURE EXTRACTION NETWORK
The main challenges of the refined cloud classification of ground-based cloud images are as follows: (1). Different cloud shapes cannot be characterized by the same type of features. For example, the shapes of cumulus and altocumulus in Fig.3 are quite special, while the shape of stratocumulus, cirrus and stratus are not obvious and can only be characterized by texture or color.
(2). Some different clouds sometimes have similar parts. For example, in the green box of cumulus and altocumulus in Fig.3, there are half clouds and half the sky with obvious edges. At the same time, the yellow boxes in cirrus and stratus clouds also have similar textures. If the sampling happens to be this way, it is easy to cause confusion.
(3). Even if it is the same type of cloud, the scales are often different. For example, the red and yellow boxes of cumulus in Fig. 3 are typical cumulus clouds.  Therefore, in order to obtain the best ground-based cloud image features, it is necessary to be able to capture image features of multiple levels and multiple levels. Features are less affected by spatial position and angle. Convolutional neural networks have good multi-level feature extraction capabilities, which can meet the needs of this article.
The network model designed in this paper is shown in Fig. 4. Because the ground-based cloud map scene is relatively simple, it is not suitable for deep convolutional networks. Therefore, this paper constructs a neural network (Multi-CNN) with 5 convolutional layers and 2 fully connected layers. Due to the limited collection of ground-based cloud images and the difficulty of labeling, there is currently no training set that can meet the needs of this article. To solve this problem, this paper uses a pre-trained model to extract features. First, use a general large image data set such as ImageNet to train Multi-CNN, and use the trained model to extract multi-layer convolution features from ground-based cloud images. It should be pointed out that this article only uses the convolutional layers of Multi-CNN for feature extraction, and the fully connected layer for classification will not be used in this article.
Part of the features of the five-layer convolutional layers are randomly selected and visualized by deconvolution, VOLUME 8, 2020 as shown in Figure 5. It can be seen that the first layer of convolution filtering mostly responds to simple single textures or local gradients in different directions, the second layer characterizes some combined textures, and the third layer can characterize the structure and contours generated by the texture. By analogy, the deeper the convolutional layer can represent the higher-level and more abstract semantic features. In order to analyze the effect of the Multi-CNN network model on extracting features of ground-based cloud images, first extract the features of each convolutional layer through Multi-CNN, and then reconstruct the input image based on the features of each convolutional layer, as shown in Fig. 6. For shallow convolution features, more refined local details can be retained. For deep convolution features, some low-level detailed information is filtered or transformed into more abstract high-level semantic information. It can be seen that the convolutional layers of Multi-CNN for feature extraction can extract and retain multi-scale and multi-level rich features, which can meet the needs of subsequent classification.

B. FEATURE SCREENING NETWORK
As mentioned earlier, one of the difficulties in cloud-like classification and recognition of ground-based cloud images is that different types of cloud images may have some familiar parts. Therefore, finding the key differences between cloud shapes is crucial for distinguishing different cloud shapes. It can be seen intuitively from Fig. 7 that the local areas in the green and yellow boxes in the cumulus, cirrocumulus, and cirrus clouds are very similar and difficult to distinguish. The local area in the red box is the key to distinguish these three foundation cloud maps. Therefore, in the recognition of different types of cloud images, it is not necessary to extract and retain all the local information of the image, and some local information may even weaken the difference of different cloud shapes and affect the final result. It is more important to extract key local information with distinguishing ability than to extract all local information as completely as possible. In order to further verify this point of view, we randomly selected 40 ground-based cloud images for each cloud shape, a total of 360 images of nine cloud shapes. Multi-CNN is used for feature extraction. Taking the conv5 layer as an example, it contains 13 × 13 = 169 feature blocks, and each feature block corresponds to an image block on the ground-based cloud image. There are 360 pictures in total, so all the partial image blocks total 169 × 360 = 60840. Cluster the features corresponding to these local image blocks, and then count the distribution of each type of local feature on the 9 cloud images.
The results in Table. 1 are the results of clustering the local image blocks into 5 categories. It can be seen that the fifth type of local image block after clustering is an untextured area close to the clear sky, and such local features appear in almost all ground-based cloud images. Obviously, the features of this type of local area are not very helpful for distinguishing different cloud-like categories. On the contrary, too much such local information will cause confusion and interference in feature classification. This type of local feature can be considered as redundant. It is necessary to remove such redundant local features before performing local feature encoding.
In order to get a better clustering effect, this paper adapted a density peak-based clustering method (DP). Compared with the common K-means clustering method, the clustering results of DP clustering are more deterministic.
In DP clustering, the selection of the cluster center point is mainly based on two considerations: 1) The density of the sample point is greater than the surrounding points; 2) The distance between the sample point and the point with greater density is as large as possible. Based on this goal, the cut-off kernel is first used in DP clustering to define the local density function of the sample points: where, d c is a preset value, which means that the density of the sample points is calculated statistically within the radius of this distance. d ij represents the distance between sample point i and sample point j. After obtaining the local density, define the distance offset of each sample point: where, condition1 means that the sample point i is not the point with the highest local density. Condition2 means that the sample point i is the point with the highest local density. The meaning of distance offset is the minimum value of the distance between point i with all the sample points with local density greater than point i. And the point with the largest local density is treated separately as the maximum distance from all points. After the local density (ρ) and distance offset (δ) are defined, the cluster center point can be determined by the larger ρ and δ values of the sample points. As shown in Fig. 8, for the sample distribution in the left picture, make a ρδ coordinate map, and it can be found that No. 1 and No. 10 are better than other points, and they are most suitable as the cluster center points. When statistically filtering the local convolution features of each level of Multi-CNN, first cluster all the local features of this layer in the sample image into M categories using the DP clustering method, and each local feature will get a cluster label c = {1, 2, . . ., M} . According to the cloud type of the original image where each local feature is located, each local feature is given a corresponding cloud label l = {1, 2, . . ., L}, where L is the total cloud type. Then count the distribution of each type of local feature in each type of cloud, denoted as D(c i ,l j ), and calculate as follows: where, c i ∼l j represents the number of j-th cloud label in the i-th cluster label. c iALL represents the total number of i-th cluster label. Based on this, the distribution of local features corresponding to a class of cluster labels in various ground-based cloud images is obtained. If the local features corresponding to a cluster label are evenly distributed in the images of each cloud type, it means that these local features are not very helpful for cloud recognition. On the contrary, if the local features corresponding to a cluster label are not uniformly distributed in the images of various cloud-like types, the discriminating ability of this type of local features is strong. Here we define the discriminative ability of the i-th cluster label as disc(c i ), which is calculated as follows: Sort according to the discriminative ability disc(c i ) of various clustering modes. Local features corresponding to cluster label with the smallest discriminative ability should be eliminated, and the other local features will remain. Fig. 9 and Fig. 10 respectively show examples of local image blocks that need to be eliminated and retained after screening in this article, and their distribution in various ground-based cloud maps.
It can be seen from Fig. 9 that the image blocks corresponding to the eliminated local features are very flat and almost have no texture features, and the colors are relatively similar. It is widely distributed in various ground-based cloud maps, and the distribution is relatively even, and the highest distribution ratio does not exceed 14%. The image blocks corresponding to the retained features shown in Fig.10 are more than 30% concentrated in cumulus clouds, but hardly exist in clear sky, stratus clouds, and high-level clouds. The distribution in other cloud images is also different.  However, in the actual online test or observation process, images are often input one by one, so frequent clustering is time-consuming and the clustering results are not representative because only one image is input. Therefore, the cluster centers of all cluster labels during the training process will be recorded. During the testing process, the distance between the local features extracted from the image and the cluster centers will be directly compared and the category represented by the closest clustering center will be taken as the clustering category of each local feature point. The local features corresponding to the cluster labels with the weakest discriminative ability will be eliminated directly without participating in the subsequent feature encoding process.

C. FEATURE CODING AND FUSION NETWORK
After obtaining the filtered local features, the next step is to use these local features to form the global features of the ground-based cloud image for classification. When using local features of the convolutional layer to encode global features, due to the fixed structure of the convolutional neural network, the size of the input image will affect the size of the feature map output by the convolutional layer, so the encoding process cannot be affected by the number of local feature points of the image. In addition, the semantics (cloud types or characteristics) involved in ground-based cloud images are not related to the geometric spatial position and distribution, so the final result generated by feature coding should be as independent as possible or less sensitive to image rotation and geometric transformation.
The function of the fully connected layer in Multi-CNN is actually similar to a nonlinear feature encoding based on neural networks. Fig. 11 shows a schematic diagram of using a fully connected layer to re-encode the local feature map of the convolutional layer. Fig. 11(b) shows the local feature map output by any certain layer of convolutional layer. Each point in the figure is Corresponds to a local area in the original input image, as shown in the three colors of red, green and yellow in the figure. The simplest local feature encoding method is shown in Fig. 11(b) by connecting the local features according to a certain rule, but obviously this method is relatively simple and inefficient. The fully connected neural network shown in Fig. 11(c) is a nonlinear mapping based on this, avoiding the disadvantages caused by the change in the number of local features in simple series. The non-linear changes of the fully connected neural network weaken the geometric space constraints to a certain extent, but it can only be done after the local features are arranged in a fixed order, which actually contains spatial location information.
This paper proposed another local feature encoding method based on Fisher Vector. Fisher Vector coding first assumes that all image information can be characterized by several feature points or feature descriptors. These feature points are considered to satisfy an independent random distribution and can be expressed by a Gaussian Mixture Model (GMM). The Gaussian mixture model can be expressed by (7).
where, p(x) represents the probability of the feature point x, K represents the number of Gaussian models, ω k represents the weight of the kth Gaussian model, and N(x|µ k , δ k ) represents the kth Gaussian distribution, µ k and δ k respectively represent the mean and variance of the Gaussian distribution. Suppose that the probability of several feature points or local features x i of an image X appearing in the mixed Gaussian space is p(x i |λ), where λ = {ω i , µ i , δ i , i = 1, . . ., K} represents the parameter group of the Gaussian mixture model. The probability of image X can be expressed as: where, N represents the number of feature points or local features contained in the image X . For the convenience of calculation, usually take the logarithm of this expression before proceeding to the next step. The expression after taking the logarithm can become: Since the Gaussian distribution is a weighted mixture of multiple Gaussian distributions, there is: Find the partial derivative of L with respect to weight, mean and variance as shown in (12).
where, K is the number of Gaussian models in GMM, from which we can get K groups of such parameter partial derivatives. Assuming that the dimension of the feature point or local area feature xi is D-dimensional, the parameters µ k , δ k and their corresponding partial derivatives are also D-dimensional vectors. In this way, these parameters can be concatenated into a K × (2D + 1) dimensional feature vector, which is called the Fisher Vector. The feature encoding method based on the Fisher Vector can not only form a global feature independent of the number of local feature points, but also increase the feature dimension and make the feature more linearly separable. In order to compare the sensitivity of the two feature encoding methods to image rotation and other changes, the same ground-based cloud image is rotated and transformed, as shown in Fig. 12. Then compare the global features after feature encoding using different feature encoding methods. The two encoding methods are all based on the feature map output by conv5 in Multi-CNN for feature encoding. The result obtained through the fully connected layer coding can be directly represented by the output result of the FC7 layer fully connected layer. Since the dimension of the feature map output by the conv5 convolutional layer is 512, the number of Gaussian mixture models of the Fisher Vector is set to 4 to ensure that the dimension of the finally obtained Fisher Vector feature vector is 512 × 2 × 4. Taking the six images shown in Fig. 12 Therefore, the difference in feature expression caused by image flip and rotation in the two encoding methods can be obtained. As shown in Table 2 and Table 3, when the image changes, the feature change of Fisher Vector encoding is much smaller than that of fully connected layer encoding. Therefore, this article chooses Fisher Vector coding method for feature coding fusion.  After the global features are obtained by feature encoding fusion, multilayer perceptron neural networks (MLP) [18] is selected for image classification.

IV. EXPERIMENTS AND ANALYSIS A. DATASET
The data set in this article is mainly derived from the scanning all-sky imaging system developed by the Observation Center of the National Meteorological Administration of China. These images were taken at the Beijing Observatory. More than 1,300 images were selected by experienced meteorological observation experts and labeled into 9 categories: Clear sky, Cumulus, Stratocumulus, Stratus, Altostratus, Altocumulus, Cirrocumulus, Cirrostratus, Cirrus, each type of cloud images are in the range of 60 to 200. Due to the similarity of some clouds, it is difficult for non-professional meteorological observers to see the differences, such as altocumulus and cirrocumulus, cirrus and cirrostratus, stratus and altostratus. Therefore, this paper uses two classification standards to test the algorithms separately, as shown in Fig. 13 and Fig. 14. Among them, the low-level difficulty is classified into six categories, combining two very similar ground-based cloud  images into one category. The high-level difficulty is classified into nine categories, each of which is independent of one category.

B. GROUD-BASED CLOUD IMAGE RECOGNITION EXPERIMENT
The algorithm in this paper is based on the Multi-CNN convolutional neural network, which extracts multi-layer and multi-scale convolution features. Coding and fusing the local features after screening, and then classifying them through MLP. In the process of testing the effect of ground-based cloud recognition, tests will be carried out according to two groups of six classification standards and nine classification standards. In order to reflect the robustness of the classification method, ten rounds of random experiments will be carried out for each group. In each round, some images are randomly selected as the training set, and the remaining images are used as the test set. Then take the average of ten rounds of results as the final test result. In addition, other advanced methods are also tested in the same data set. Table 4 and Table 5 are the test results under the six classification standards and nine classification standards respectively. There are three cases for the method in this paper: 'Ours(conv5)' means that only conv5 convolution features are used in the feature extraction process, 'Ours(conv345)' means that multi-scale features of conv3, conv4 and conv5 are used, 'Ours(conv-all)' means that all 5 convolutional layers are used.
From Table 4 and Table 5, we can see that the method in this chapter is better than other methods under the two classification standards. Under the nine-category classification standard, the training samples of each category are fixed to 40 images. Comparing the results in the penultimate column of Table 4 with the last column of Table 5, it is found that the classification accuracy rate of the nine categories exceeds that of the six categories in the method of this paper. There  are two reasons. On the one hand, in the 6 classification standards, only 40 samples of the merged two types of clouds are selected as training samples, and a total of 240 images of the six types. Under the 9 classification standards, a total of 360 images participate in training. The total number of training samples affects the coverage of the training set and the generalization ability of the model training. On the other hand, for the merged two types of clouds, there may be an uneven distribution when randomly selecting training samples. When the feature distinguishing ability of the two types of samples is excellent, this unbalanced selection of training samples may cause interference to the classifier. The individual recognition effect of cirrus and cirrostratus, cirrocumulus and altocumulus, stratus and altostratus is very good. These phenomena indicate that the ground-based cloud image features obtained by the method in this paper can distinguish various cloud shapes well, and have strong characterization and discrimination capabilities. Figure 15 shows the ROC curve of each network under 9 classification standards. It can be seen that the network in this paper has the highest recognition accuracy and the maximum AUC. The number of training samples in this paper is small, and the samples are rotated. The test requires more detailed classification of the model, so the network involved in the test is unstable and the accuracy decreases. However, the model presented in this paper has good accuracy and robustness under a small number of samples.

C. EFFECT EXPERIMENT OF LOCAL FEATURE CODING
Previously, this article roughly analyzed the sensitivity of the Fisher Vector feature encoding method to image rotation by comparing a set of sample pictures. In order to further verify that the performance of the Fisher Vector encoder is better than that of the fully connected layer (FC) encoder, a set of comparative experiments is designed based on the local features of the conv5 output. There are mainly three situations: 1. Use two fully connected layers to encode the local features of the conv5 layer (FC); 2. Connect the local features of the conv5 layer directly in sequence (Conv5); 3. The local features of the conv5 layer are coded with Fisher Vector using 4 Gaussian mixture models (FV). Table 6 and Table 7 show the performance of the three conditions in the recognition of ground-based cloud images. It can be seen that the performance of using Fisher Vector for feature encoding is the best, and the performance of using fully connected layer features is almost the same as that of directly connecting the local features of the conv5 layer. It is proved that when the local convolution features are input to the fully connected layer, the arrangement of the features has been implicitly included in the output result of the fully connected layer. The characteristic of ground-based cloud images is weak spatial constraints, so local feature coding based on Fisher Vector is more suitable for ground-based cloud images.

D. MULTI-LEVEL AND MULTI-SCALE FEATURES EFFECT VERIFICATION EXPERIMENT
The algorithm in this paper uses a total of five convolutional layers of multi-scale and multi-level feature information. Each convolutional layer expresses different semantic information. In order to explore the performance of each level, the characteristics of each convolutional layer are used VOLUME 8, 2020   independently to compare the performance. The performance of each convolutional layer feature when used alone is shown in Fig. 16 and 17. It can be seen that when a single convolutional layer is used alone, except for the second layer, the accuracy of the features of the other layers under the 9-category classification standard is always lower than that under the 6-category classification standard. The characteristics of the second convolutional layer have actually achieved a better discrimination effect. Starting from the third layer, the effect of the deepening of the hierarchy is no longer obvious. In fact, this is mainly because the pre-trained model is only a feature description learned on the basis of a large number of ordinary scene images, and there is no special retraining parameter tuning for cloud recognition tasks. On the other hand, it also shows that for ground-based cloud images, the high-level semantic information is not so complicated and does not require too deep convolutional neural networks.  In order to verify the effect of multi-scale and multi-level convolutional features on recognition, gradually increase the number of convolutional layers. As shown in Figure 18 and Fig. 19, from using only the fifth convolutional layer (conv5) to using all five convolutional layers (conv-all), each additional scale or level increases the classification effect. When conv2 was added, the network performance improved the most. The number of training samples per category in this experiment is 80. When all the convolutional layer features are used, the recognition accuracy of the network under the 9 classification criteria reaches 94.8%. In addition, as can be seen from Fig. 18 and 19, with sufficient training samples, as the number of convolutional layers used increases, the accuracy of the 9 classification standard will be higher than the accuracy of the 6 classification. This shows that the network has a strong ability to extract and distinguish the features of different cloud images.

E. FEATURE SCREENING EFFECT VERIFICATION EXPERIMENT
An important part of this article is the selection of local features, which is also the key to refined cloud recognition and classification. Therefore, a set of comparative experiments is set up to test the effect of local feature screening on cloud image recognition.
In this experiment, the features of the three conv3, conv4, and conv5 convolutional layers were tested for the independent feature screening effect and the feature screening effect after the three levels of fusion. The experimental results are shown in Table 8 and Table 9. Whether using singlelevel or multi-level convolution features, feature screening can significantly improve the effect of cloud classification and recognition. In the process of feature screening, the results of clustering may also have a very important impact on the effect of feature screening. This article uses DP clustering and the number of clustering categories is set to 10. The previous chapters have shown that DP clustering performs better than the commonly used K-means clustering. In order to further research the effects of different clusters and the influence of the number of cluster categories on recognition results. A set of comparative experiments was set up. Select the local features of three independent conv3 layers, conv3, conv4, and conv5 as the basis, and select the number of clusters 3, 5, 8, and 10 in both K-means and DP clustering methods for experiments. The test results are shown in the Fig. 20 shown. It can be seen that the improvement effect of DP clustering is more obvious, and the K-means clustering method cannot even improve the final classification effect in some cases. When the number of clusters is 10, the feature filtering effect is the best, and too few clusters (when it is 3) will make the improvement effect decrease.

V. CONCLUSION
This paper proposes a ground-based cloud image recognition system based on convolutional neural network and feature selection and fusion. The multi-level and multi-scale convolution feature extraction is performed through the convolution layer of Multi-CNN, and the local features with strong resolving power are selected through the feature selection algorithm based on DP clustering. Based on Fisher Vector, the local features are encoded and fused to obtain global features, and MLP is used for classification prediction. Comparative experiments show that the algorithm has a strong ability to extract and distinguish the features of different cloud images.
MA JINGYI received the master's degree in information and communication from the University of Science and Technology of China, in 2008. He is currently working with the Gansu Branch, China Meteorological Administration Training Centre. He has more than ten years of research experience in the field of information and communication.
He has published more than 20 academic articles in this field in peer-reviewed journals at home and abroad.
TIEJUN ZHANG, photograph and biography not available at the time of publication.
JING GUODONG, photograph and biography not available at the time of publication.
YAN WENJUN, photograph and biography not available at the time of publication.
YANG BIN, photograph and biography not available at the time of publication.