Remote Sensing Scene Classification Via Multigranularity Alternating Feature Mining

Models based on convolutional neural networks (CNNs) have achieved remarkable advances in high-resolution remote sensing (HRRS) images scene classification, but there are still challenges due to the high similarity among different categories and loss of local information. To address this issue, a multigranularity alternating feature mining (MGA-FM) framework is proposed in this article to learn and fuse both global and local information for HRRS scene classification. First, a region confusion mechanism is adopted to guide network's shallow layers to adaptively learn the salient features of distinguishing regions. Second, an alternating comprehensive training strategy is designed to capture and fuse shallow local feature information and deep semantic information to enhance feature representation capabilities. In particular, the MGA-FM framework can be flexibly embedded in various CNN backbone networks as a training mechanism. Extensive experimental results and visualization analysis on three remote sensing scene datasets indicated that the proposed method can achieve competitive classification performance.


I. INTRODUCTION
S CENE classification from high-resolution remote sensing (HRRS) images is the key task of intelligent remote sensing information processing. Scene land-use and its dynamic updating information are very helpful to characterize surface conditions and evolution, supporting resource management and optimize in conjunction with urban and rural planning, assessing climate change and ecosystem changes, and promoting sustainable development [1], [2], [3], [4], [5]. HRRS images Manuscript  take a rich and detailed structure feature and abundant spatial patterns. For HRRS scene classification, the purpose is not to classify pixels or objects, but to identify and label the whole scene image. However, due to the diversity of resolution and the complexity of ground objects covered by HRRS, that is high interclass similarity and big intraclass diversity, the HRRS scene classification is still a challenging task [6].
Recently, many deep learning (DL) models have achieved significant success in HRRS classification (such as AlexNet [7], VGG [8], GoogleNet [9], and ResNet [10]) [11] as they can extract more class-specific features by deep neural network in an end-to-end learning manner. Improving the ability of feature representation is the key to classification performance, and also the research goal of this article. Many convolutional neural networks (CNNs) models pretrained on ImageNet were transferred to HRRS scene classification [12], [13], [14], [15], [16], [17], [18], [19], [20] to extract more abstract and semantic features. As supervised learning, researchers have done a lot of work in data augmentation [21], [22] and feature representation performance improvement. These works have achieved promising performance for HRRS scene classification, but there still exist some limitations. First, the high similarity among different categories lead to a lack of alignment between the learned features of an image and its corresponding semantic labels, which limits the performance of semantic features [23]. Such as shown in Fig. 1(a), the global features learned from the two images are different, however, they belong to the same category "airport" since the recognition object aircraft and runway of the specific category that they share provide key local features in classification and recognition. On the contrary, in Fig. 1(b), "dense residential" is very similar to "commercial." If we need to distinguish them, we cannot only rely on the local features since they contain the same categories. We need more global features, such as more unified building style, to judge. Second, the down sampling operation and feed-forward mechanism of the CNN resulted in the deep dependence of the CNN on global feature information to obtain output, which may lead to the loss of local feature information [24], [25]. As a result, it may lead to misclassification.
To address the aforementioned problems, we put forward a hypothesis that the discriminative local detailed information of HRRS images is naturally lain in different granularities patches of images. We adopt a region confusion mechanism (RCM) [26] to partition the input image into local patches, and then, splice together into image with smaller granularity levels of information. This mechanism can shuffle the local regions to destroy the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Images selected from very HRRS in AID to clearly show the relationship between scene category. (a) Images both selected from the category of "airport." (b) Left and right images are selected from the "dense residential" and "commercial" categories. global structure, in order better force the network to learn from the distinguishable area details for classification. However, when disrupting the original image, RCM may introduce noise, which will make the network learn some unnecessary features, such as edge noise and uncorrelated regional connectivity features [26]. So, we designed an alternating comprehensive training strategy. The strategy is to first use the out image generated by RCM to train the shallow convolution blocks of the network to pay attention to learning local details before training the whole network by the original image, and last to fuse cross-granularity feature fusion by training. The whole strategy aims to ensure the semantic relevance of the two types of features through the alternative learning of local features and global features. Combining RCM and the alternate comprehensive training strategy, we propose a multigranularity alternating feature mining (MGA-FM) framework to improve the HRRS scene classification performance, which can promote the network to obtain a more comprehensive feature representation, including local detailed information learning, and cross-granularity feature fusion during training. The complementary relationship has been fully explored in alternative comprehensive training. Since the framework has two purposes: local detailed feature and global feature learning and fusion, so the framework is called feature mining. The main contributions of this article can be summarized as follows.
1) To guide the shallow convolution blocks of the network to pay more attention to the discriminative regions to learn the local features without additional annotation information, we adopt the RCM, by partitioning the input image into local patches, and then, splicing together into image with smaller granularity levels of information. 2) An alternating comprehensive training strategy is put forward to promote the network to extract and fuse useful local features and global features more efficiently. Three training steps (local feature learning, global feature learning, and cross-granularity feature fusion) are carried out alternately. On one hand, it can avoid the loss of important local details in deep network training, and on the other hand, it can prevent the noise pattern caused by over fitting RCM and seek the semantic correlation between different regions in the whole image. 3) A framework combined RCM with the alternating comprehensive training strategy, namely, MGA-FM, is proposed and applied in HRRS images scene classification. This framework does not need prior knowledge in the training process, and only needs the calculation cost of standard classification network feed-forward and feature fusion branch in inference. Furthermore, this framework can come into effect on various CNN backbone networks. 4) Experimental results and visualization analysis on three kinds of datasets (UCM, AID, NWPU-45) show that our proposed MGA-FM framework has achieved compelling results in the HRRS scene classification task. The rest of this article is arranged as follows. Some related works are briefly reviewed in Section II. Section III describes the proposed method in detail. Experimental results and discussion are carried out in Section IV, while Section V concludes this article.

A. Feature Representation for HRRS
Feature representation plays a key role in HRRS scene classification. According to the generation method and presentation time of features, image features can be divided into three categories: hand-crafted features; unsupervised features; and supervised features. The performance of feature representation has been improved gradually. Among them, features obtained by DL in supervised learning have achieved the best performance and are more applicable to the HRRS scene classification. Several visualization experiments explored the relationship between feature layers and semantics, that is, semantics covered by different CNN feature layers are different [24], [25], [27], and local feature contains different semantic compared with global feature. That means local information will inevitably loss after going through the pooling layer, and how to solve the loss of local feature information becomes a great challenge [28], [29], [30], [31], [32].

B. Attention Mechanism
In recent years, attention mechanism has become an important method to reduce local information loss. In [28], [33], [34], [35], [36], [37], the attention mechanism was introduced to obtain discriminative regional features, which promoted the classification performance. The authors in [30] and [38] combined discriminative local features extraction technology with multibranch feature fusion and attention algorithms to further gain competitive performance. Wang [38] and Zeng [39] cropped the input image to learn multigrained regions, which are conducive to guide the network to learn more discriminative local detailed features, but the semantic correlation between different regions of the whole image is ignored. In order to solve the loss of local features and avoid increasing the complexity of network structure, we adopt an RCM) [26] to partition the input image into local patches and then splice them together into an image with smaller granularity levels of information. This mechanism can shuffle the local regions to destroy the global structure in order to better force the network to learn from the distinguishable area details for classification. The traditional attention mechanisms need supervision information or weak supervision information to train the network to obtain the location of attention, so the speed is slow in inference, while the RCM only need to introduce a data enhancement method in the training process to make the network learn the discriminative region so that there is no need to speculate the location of attention in the information.

C. Feature Fusion
Fusing features of different layers is another of the main attempts to solve the loss of local feature information [28], [29], [30], [31], [32]. Some literature regarded deep CNN models as feature extractor and combined extracted features with the feature coding technology. For instance, the method in [12] selected multiscale dense CNN output vectors of the last convolutional layer as local detailed features, and then, encoded as the global feature by bag of visual word, local cluster descriptor vector [40], and improved fisher kernel [41], to encode them further to generate the final image representation. Liu [42] found that the feature maps of lower layers in the CNN can provide rich and powerful information. Therefore, a two-phases typical feature fusion architecture was presented to improve the expression ability of features. In [29], [43], [44], and [45], researchers designed multibranch networks to boost the classification accuracy. The authors in [46] utilized convolution filters to obtain the spatial distribution features of local blocks of deep features by sliding window, and improved the classification performances of deep features. Notwithstanding, with the training of network, the simple application of the feature fusion strategy will still lose class-specific discriminative local feature information, which may lead to unsatisfactory performance. Furthermore, when disrupting the original image, the RCM may introduce noise, which will make the network learn some unnecessary features, such as edge noise and uncorrelated regional connectivity features [26].

III. PROPOSED METHOD
The implementations of the proposed MGA-FM framework for HRRS images scene classification are described in detail. Feature mining means that the network can learn local distinguishing information and fuse this local information with complementary global information efficiently during training. We adopt the RCM to guide the network's shallow layers to focus on multigranularity distinguishing regions, and then, design an alternating comprehensive training strategy to gain and fuse complementary global features.

A. Network Architecture
The proposed MGA-FM framework could come into effect on any state-of-the-art backbone feature extractor, such as ResNet and VGG. Suppose F is our backbone feature extractor, which can be divided into L stages. The output feature mapping from the lth stage is expressed as F l , where l = { 1, 2, . . . , L }. As shown in Fig. 2 In addition to output feature mapping, the outputs of the last stage L and the middle stage L/2 were fused to form a new concatenation feature representation V concat , which can be seen in Fig. 3. The V concat contains local detailed information and global information.
The lth stage output vector representation V l can be followed by a corresponding classification module H l class , which consists of two fully connected layers of Batch Norm and Elu nonlinearity, to obtain the intermediate stage's classification prediction over p l = H l class (V l ). The prediction from concatenation feature stage is p concat = H concat class (V concat ).

B. Region Confusion Mechanism (RCM)
We humans can understand the meaning of a sentence although the words in sentence are shuffled. It is because that our human brain would focus on the discriminating words and ignore the unimportant. Inspired by natural language processing, we divide the image into several subregions, and then, randomly disrupt the spatial position order of the subregions, forcing the network model to recognize the image only based on the local image features, so as to improve the learning ability of the network for the local image detailed features.
An RCM is one kind of image shuffled skills, which is suitable for self-monitoring tasks in representation learning, RCM cut images into grids, n × n, the patch is then disordered as training data, and the goal is to restore its correct spatial configuration or as a sample of adversarial learning [47]. We neither restore the correct spatial structure of the original image nor conduct confrontation learning but borrow the concept of the RCM to generate the input image for the CNN shallow layer training. The purpose is to retain local details of images and promote the network to pay attention to the local structure of objects in images, reducing the network's excessive attention to semantic information in input image training, and improving the model's generalization ability in scene classification, and to solve the issue of large intraclass variance caused by large resolution variance.
As shown in Algorithm 1, given an input image X ∈ R H×W ×3 , where 3 is the channel of image, W and H is the width and height, and label y, RCM equally splits the image into n × n three-channel patches. To guarantee the integrity of each patch, n should be able to divide both W and H. The generated  For the RCM, the original image X is randomly cut into n × n patches by RCM, and then, randomly spliced into a new image X n , the process can be expressed as And, take a pair of X and X n generated by the RCM as an example, x ij represents a three-channel block of coordinates (i, j) as x n−1,0 x 1,0 x 0,1 x 0,n−1 x 0,0 · · · x n−1,n−1 . . . . . . . . .
The RCM cannot always guarantee that every independent object can be in the same patch, where may occur some objects' RCM Generate shuffled image X n . 4: Train the model first L/2 layer conv with X n . 5: Calculate the loss function l 1 by Cross Entropy Loss. 6: Update parameter θ through back propagating the loss l 1 . 7: end if 8: if step = 2 then 9: Train the model whole layer conv with X. 10: Calculate the loss function l 2 by Cross Entropy Loss.

11:
Update parameter θ through back propagating the loss l 2 . 12: end if 13: if step = 3 then 14: Fuse features derived from the L/2 layer conv and the last layer conv. 15: Train the model whole layer conv with X.

16:
Calculate the loss function l 3 by Cross Entropy Loss. 17: Update parameter θ through back propagating the loss l 3 . 18: end if 19: l total ← l 1 + l 2 + l 3 20: if l total not decrease for 10 epochs then 21: Break.

22:
end if 23: end while 24: return Predict probability p size bigger than the split patch size. However, it may not be bad news for MGA-FM. Conversely, MGA-FM adopts random cropping before splitting and each puzzled images X n generated by the RCM for new epoch may be different from previous epoch, the discriminative small patches split at one epoch are not always split in other epochs. Thus, it will force our model to find more discriminative parts and bring additional benefits.

C. Alternating Training Strategy
We adopt the alternating comprehensive training strategy to extract and fuse multigranularity features, each batch data will be trained for three steps. The output of each step will be trained independently, and the parameters used in the current step will also be optimized, which helps each step train the network together.
As shown in Fig. 3, each epoch consists of three steps in the alternating comprehensive training procedure. In the first step, the input HRRS original images are processed into new confused images by the RCM and sent to the middle stage L/2 convolution block of the backbone feature extraction F . Gradient propagation is carried out in shallow convolution blocks of the network model to train the network model to focus on local detailed information of images. At the same time of introducing shuffled images, some uncertain noise vision modes will be introduced. In order to counteract the negative effects, in the second step, HRRS original images are sent to all convolution blocks of the backbone feature extraction F to train the complete network, expanding the attention area of the complete network to the whole image and promoting the network model to draw attention on global semantic information. It can minimize the impact of noise mode, and filter irrelevant factors by only retaining local details beneficial to classification. In the third step, feature fusion branch is added to the network trained in the second step to obtain feature expression with stronger semantic information and local detailed information. Details of training for MGA-FM can be seen in Algorithm 2.
The network is trained end-to-end. This framework can promote the network to obtain the feature description with stronger global semantic information and local detailed information for HRRS images scene classification. Specifically, outputs from the middle stage and the last stage, and the output from the concatenation feature stage in the training part, we adopt crossentropy L CE to compute the loss between the ground truth y and predicted probabilities p l as where m is the number of samples within one batch, y i is denoted as the ground truth of the ith sample, and p l i is the predicted probabilities of the ith sample at stage l.
Furthermore, no additional computational overhead is found in the inference part except the standard classification network feed-forward and feature fusion branch. We only need to put test images into the trained network, and only through the third step of the training procedure as shown in Fig. 3. The prediction can be written as follows: prediction = arg max(p concat ). (3)

A. Experimental Dataset
In this work, we use UC Merced Land-use (UCM), Aerial Image Dataset (AID), and NWPU-RESISC45 (NWPU-45) dataset to demonstrate the MGA-FM's performance. Sample images in aforementioned datasets are shown in Figs. 4-6, and detailed in Table I. 1) UCM [48], the most classical scene classification benchmark dataset, is artificially collected from the USGS. For  each experiment on this dataset, we set the ratio of training at 50% and 80%. 2) AID [49] is collected from Asia, Europe, North America, and other regions at different times and imaging conditions. For each experiment on this dataset, we set the ratio of training at 20% and 50%.

3) NWPU-45 [50] is composed of images obtained by
Google Earth through satellite images, aerial photography, and geographic information system. For each experiment on this dataset, we set the ratio of training at 10% and 20%. Traditional data augmentation strategies are applied to the training set, including rotating images by 90°and flipping them horizontally and vertically.

B. Implementation Details
We take the experiments under the Pytorch implementation code on the Ubuntu 18.04 operating system environment with NVIDIA GeForce RTX 2080ti * 2 GPU, i7-8700 k CPU, and 16 GB of RAM.
The settings for each stage's convolution block H conv from the MGA-FM network are as shown in Table II. In the alternating comprehensive training, we set the initial learning rate at 0.002, where the cosine annealing principle is used to update the learning rate as training goes on. The batchsize of MGA-FM is set to 16. The training is stopped after 200 epochs of completing training or terminated if the test loss value does not decrease for ten epochs. Only the stochastic gradient descent (SGD) is used in all experiments to optimize the network. We take the average and standard deviation (Std.) of five times for each experiment as the final performance representation. The training and test sets for each validation are reselected.

C. Performance of the MGA-FM Method
In order to verify the effectiveness of improving the performance of the baseline model, we choose VGG16 [8] and ResNet50 [10] as the backbone network of MGA-FM on the UCM, AID, and NWPU-45 dataset, respectively, and then, compare with the baseline models to verify the effectiveness. Overall accuracy (OA), Std, inference time, and confusion matrix are used to evaluate in the following experiments. The inference time is the average operation time of a single image in inference. The confusion matrix is used to evaluate the degree of confusion of various categories in dataset. The matrix element represents the ratio between the predicted and ground truth for the corresponding category. As shown in Table III, we can find that the OA of MGA-FM(VGG16) and MGA-FM(ResNet50) method have been improved by 3.56% and 3.67%, 6.08% and 6.24%, and 9.97% and 9.00% than the baseline on UCM, AID, and NWPU-45, respectively, while the inference time of MGA-FM(VGG16) and MGA-FM(ResNet50) method have only been added by 0.121 and 0.072 s than the baseline, respectively. It proves that the MGA-FM method can come into effect on various CNN backbone feature extractors and effectively improve the representation of features at a lower inference time cost.
Beside the high OA values on the three datasets, the corresponding confusion matrix confirms its superior performance, which is exhibited in Figs. 7-9. From Fig. 7, we can find that confusion categories only appear between "forest" and "agriculture," and "medium residential" and "dense residential," and the accuracy of each category is 95% or more. Fig. 8 shows that the classification accuracy of most categories has reached more than 90%. From Fig. 9, we can find that among the 45 categories, 41 categories have an accuracy of more than 90%, where six categories reach 99%. We can conclude from the experimental results that the MGA-FM method can obtain better comprehensive feature representation than their backbone networks.

D. Performance Comparison With the State-of-The-Art Methods
In this section, the proposed MGA-FM(ResNet50) method is compared with state-of-the-art methods used in HRRS images scene classification. The methods include GoogLeNet [51], DenseNet [52], Triplet Network [43], Multi Branch Neural Network [44], CaffeNet with DCF [46], VGG-VD16 with DCF [46], ACNet [34], MSA-Network [51], and other methods applied to HRRS images scene classification. Experiments are performed on the UCM, AID, and NWPU-45 datasets. The OA, Std., and inference time are used to evaluate these methods. Table IV shows the comparison performance obtained by different methods. As shown in Table IV, MGA-FM (ResNet50) performs better than other methods on AID and NWPU-45 datasets, where the OA over 98% with 50% training data and over 99% with 80% training data on UCM. Compared with the methods based on multibranch network, such as LCNN-BFF [29], Triplet Network [43], DenseNet [52], MGA-FM(ResNet50) achieves better OA when the inference time is similar, which shows that the proposed method could learn more distinguishing features. Compared with the network method using sliding window or attention mechanism to obtain local features, such as CaffeNet with DCF [46], VGG-VD16 with DCF [46], AC-Net [34], and AMB-CNN [28], our proposed method is superior to them and has less inference time. This indicates that the alternating comprehensive training strategy can learn powerful features more efficiently. DDRL-AM [30] and MG-CAP [38] combine discriminative local features extraction technology with multibranch feature fusion and attention algorithms to further gain competitive performance, but the inference time are over 0.5 s. It is worth noting that compared with them, the OA of our MGA-FM (ResNet50) method has achieved a lead of 2.51% with 20% training data and 1.54% with 50% training data on AID, and 0.18% with 10% training data and 0.71% with 20% training data on NWPU-45, while the inference time is only 0.319 s.

E. Ablation Study
In our MGA-FM framework, we adopt the alternating training strategy to extract and fuse multistage features, and the RCM to guide the MGA-FM to focus on discriminating regions. To further discuss the effectiveness of MGA-FM, we conduct ablation studies.

1) Effectiveness of Feature Fusion:
In the alternating strategy, we aim to select the distinguishable shallow features of the shallow convolution layer to fuse with the deep features. We attempt to fuse the output features from the first stage to the fourth stage of convolution block with the last stage of the backbone feature extractor F , respectively, to verify the distinguishability of the features. ResNet50 and VGG16 are selected, respectively, as the backbone network and are performed on AID dataset with the training rate of 50%. The experimental results are given in Table V. According to Table V, we can find that whether using ResNet50 or VGG16 as backbone network, the OA by fusing the features of the third convolution stage and the features of the last stage are highest. By further analysis, we also found that the accuracy of using only a single deep feature classification is weaker than that of using the fused feature classification. This proves the effectiveness of feature fusion, and also verifies our hypothesis that the local detailed information can be helpful for the HRRS scene classification besides global features.
To verify the effectiveness of the outputs from the last stage L and the middle stage L/2 (the third convolution block in our experiments), each class accuracies based on shallow feature prediction result of the third convolution stage and deep features prediction result of the last convolution stage of MGA-FM (ResNet50) on AID with the training rate of 50% are given in Fig. 10. By comparing the histograms, it can be found that the category classification accuracy of "center" increased from 91% to 97%, that of "desert" increased from 92% to 99%, and that of "park" and "school" increased significantly (82% to 92% and 79% to 90%, respectively). The aforementioned categories always contain similar information (for example, "school" category is easily confused with "resort" and "square," as they all contain greening, roads, and buildings, while their spatial distribution is different). Furthermore, as can be observed in Fig. 10, the accuracy of classification of shallow feature prediction result in "beach," "commercial," "dense residential,"  and "port" are higher than that of the deep feature prediction result, which proves that the local detailed information can be helpful for the HRRS scene classification.
2) Effectiveness of RCM: We use Stage3 and Stage5 of the backbone feature extractor F for feature fusion, and change the partition granularity n of the puzzle to verify the effect of adding the RCM. This experiment select ResNet50 as the backbone feature extractor F and is performed on AID dataset with the training rate of 50%. The comparison results of five different n values are shown in Table VI. When n = 1, the RCM does not work and the input image is the original image. We can find that when n = 8, the highest accuracy can be achieved. It is due to a certain amount of image region confusion that can help the network learn local detailed information better and avoid the low generalization caused by excessive attention to semantic information with training. These results suggest that the RCM can improve the classification accuracy but excessive region confusion may damage the performance.

F. Visualization and Analysis
To show the advantages of the MGA-FM framework more intuitively, we use Grad-CAM [53], which can use the gradient of any target along with the last layer of the convolution network to generate a rough attention map and can be used to display the important areas in the model prediction image, to compare the convolution layers of ResNet50 and MGA-FM (ResNet50) under the baseline model ResNet50. Columns (c) and (d) in Fig. 11 visualize the third layer convolution stage and the last layer convolution stage in our MGA-FM(ResNet50), respectively. In column (c), the third stage convolution stage of MGA-FM results in the identification part with a smaller grain size, that is, the local detailed pattern or texture of the image. In column (d), we can find that in the last convolution stage, the model draws more attention to the global semantic information of the image. Especially in the "port" category in the fourth row from AID, the area of interest in column (c) cover rivers and ships, while column (d) mainly focuses on ships, which may be the reason why the classification accuracy of shallow convolution stage of "port" in the confusion matrix in Fig. 10 is higher than that of the deep convolution stage, which further visually proves the effectiveness of local detailed features for the HRRS scene classification. Given that columns (c) and (d) focus on different parts, by fusing the features of the third layer convolution stage and the last convolution stage, a more discriminative feature description can be obtained. The visualization results show that our proposed MGA-FM method can promote the network to obtain a more comprehensive feature representation.
Compared with the Grad-CAM of the baseline model ResNet50, the proposed modeling strategy has more evident and discriminative attention performance on the target object. In contrast, ResNet50 only reflects its attention on the last stage of prediction. It is perhaps the reason that the alternating comprehensive training strategy could help the network locate useful information in the shallow network. Furthermore, ResNet50 only focus on few parts of the object in the last prediction stage. While the attention region of MGA-FM(ResNet50) almost covers the whole object at the shallow convolution block and the deep convolution block. It indicates that the image generated by the RCM can force the network to learn features with local detailed and global semantic information.

V. CONCLUSION
In this article, we propose an MGA-FM framework to effectively learn more discriminative feature representations and achieve the best performance of scene classification compared with other methods. Our framework adopts an RCM to guide the network to learn local distinguishing features by self-supervised learning. In addition, aiming at the noise problem caused by the RCM, we designed a comprehensive alternative training strategy. In this strategy, local feature learning training, global feature learning training, and complementary feature fusion learning training are carried out alternately to ensure the correlation between local features and global features. More importantly, the framework can be applied to different CNN backbone networks, and can be trained in an end-to-end manner without manually marking local areas except category labels. Experiments on several HRRS public datasets show that the MGA-FM framework extracts features more practically and is more conducive to scene classification. In the future, we will study the method of compressing model parameters while maintaining the classification performance. He is currently a Lecturer with the School of information science and technology, Xiamen University Tan Kah Kee College, Zhangzhou, China. He has authored and co-authored more than 10 papers in his research interests, which include research work in the fields of machine learning and artificial intelligence. He is currently a professor with Academy of Digital China, Fuzhou University, China. His research interests include cognition and measurement of spatiotemporal systems, informatization management and decision-making service of human settlement. He has authored and co-authored more than 50 papers.