Data Mining Algorithm for Cloud Network Information Based on Artificial Intelligence Decision Mechanism

Due to the rapid development of information technology and network technology, there is a lot of data, but the phenomenon of lack of knowledge is becoming more and more serious. Data mining technology has developed vigorously in this environment, and it has shown more and more vitality. Based on Spark programming model, this paper designs the parallel extension of fuzzy c-means. In order to enhance the performance of fuzzy c-means parallel expansion, the improvement strategy of k-means during the initialization phase is borrowed, and k-means// is extended to fuzzy c-means to obtain better clustering performance. Combined with Spark’s programming model, this paper can obtain extended parallel fuzzy c-means algorithm. Several experiments on the data set of the algorithm proposed in this paper have shown good scalability and parallelism, effectively expanding fuzzy c-means clustering to distributed applications, greatly increasing the scale of the data processed by the algorithm. This improves the robustness of the algorithm and the adaptability of the algorithm to the shape and structure of the data, so that the parallel and scalable clustering algorithm can more effectively perform cluster analysis on big data. Three algorithms were simulated on MATLAB platform. We use simple data sets and complex two-dimensional data sets, and compare with the traditional fuzzy c-means algorithm and fuzzy c-means algorithm based on fuzzy entropy. Experiments show that the scalable parallel fuzzy c-means algorithm not only greatly improves the anti-noise performance, but also improves the convergence speed, and it can automatically determine the optimal number of clusters.


I. INTRODUCTION
With the rapid development and increasing popularity of the Internet, modern society is generating data at unimaginable speeds. Mobile communication, website access, logistics transportation, scientific experiments, etc., and ubiquitous social and commercial activities are constantly generating various data, marking that people have entered a brand new era, the era of explosive growth in data big data. From the literal understanding, big data only seems to emphasize the size of the data, but in fact big data is not just ''big'', unpredictable data content and diverse data structures will The associate editor coordinating the review of this manuscript and approving it for publication was Ying Li. be difficult problems that data analysis technology needs to solve [1]- [3]. This requires analytical technology to filter out low-value or low-density data, and then mine knowledge gold in high-value or high-density data [4], [5]. In recent years, the prosperity of the information industry has spawned a number of new concepts, technologies, and applications such as the Internet, massive data, massive storage, and analysis, all of which have contributed to the prosperity of big data.
After decades of changes and development, data mining has become an interdisciplinary discipline that integrates relevant knowledge from multiple disciplines such as statistics, databases, machine learning, pattern recognition, intelligence, and parallel computing [6]- [8]. Since the development of data mining, the data objects we have studied have evolved from the original regular data to the current messy and huge data [9]. Therefore, the scope of research is getting wider and wider, and the technical requirements are getting higher. At present, research on large-scale data mining is mainly based on cloud computing platforms, and distributed and parallel processing of mining tasks [10]- [12]. As one of the most widely used cloud computing platforms, Hadoop is bound to have more research based on its Map Reduce programming framework [13]- [15]. Hadoop-based parallel data mining system-PDMiner (Parallel Distributed Miner) is a data mining cloud platform developed by the Institute of Computing Technology of the Chinese Academy of Sciences [16], [17]. PDMiner has integrated many data mining related algorithms and customized the user interface. Users can submit tasks and complete goals through the interface. Fuzzy cluster analysis, as one of the main techniques of unsupervised machine learning, is a method of analyzing and modeling important data with fuzzy theory [18]- [20]. It establishes the uncertainty description of the sample category, which can reflect the real world more objectively. Effectively used in large-scale data analysis, data mining, vector quantization, image segmentation, pattern recognition and other fields, it has important theoretical and practical application value [21]- [23]. The association rule method can be used to find the relationship or rule between the values of two or more variables. The ultimate goal is to find the association network in the entire data set. Associations can be divided into simple associations, temporal associations, and causal associations. With the further development of applications, the research of fuzzy clustering algorithms is constantly enriched [24]- [26]. Relevant scholars have proposed a clustering method based on fuzzy relation composition, but due to the disadvantages of this clustering method that is not suitable for large data sets, people rarely study it [27]- [29]. Slowly, people are trying to study fuzzy clustering using graph theory [30], [31]. In order to improve the classic fuzzy c-means' ability to cluster non-linear data, related scholars introduced a kernel function, which used the kernel function to map the data to a high-dimensional feature space and then calculate the inner product [32], [33]. The kernelization transformed the expressions of membership, distance, and objective function to get a kernelized version of fuzzy c-means [34]- [36]. Relevant scholars have proposed the realization of three different sampling techniques, which are based on non-iterative extended sampling [37]- [39]. Relevant scholars have performed parallelization of the k-Medoids clustering algorithm based on the Map Reduce model, allowing the algorithm to effectively use the Hadoop cluster, and greatly increasing the scale of the algorithm's processing data [40], [41].
This paper introduces the design of parallel expansion algorithm and experimental analysis of fuzzy c-means algorithm. By combining Spark's elastic distributed data operation for parallel expansion design of the algorithm, the algorithm can use cluster for data expansion and cluster expansion data analysis tasks. At the same time, a new method is used to improve the robustness caused by the initialization of the algorithm after expansion. Because cluster computing is costly in a distributed environment, the robustness of the algorithm is very important. An algorithm that is not stable is not suitable for parallel scaling in big data processing scenarios. Experimental comparison with existing Spark-based clustering algorithms proves that the scalable parallel algorithm can correctly and effectively perform cluster analysis of large-scale data.
Specifically, the technical contributions of this paper can be summarized as follows: First, according to k-means//, an improved initialization process design is given, and the parallel design of the initialization part and the iteration part is given in conjunction with the distributed data set operation provided by Spark, so that the algorithm can be parallelized and highly expanded.
Second, the traditional fuzzy c-means clustering algorithm, fuzzy c-means clustering algorithm with fuzzy entropy, and scalable parallel fuzzy c-means clustering algorithm are applied to simple data sets and complex two-dimensional data sets. The simulation was performed on the MATLAB platform, and the clustering results were evaluated using performance indicators.
The rest of this paper is organized as follows. Section 2 analyzes cloud network information clustering in data mining technology. Section 3 studies the scalable parallel fuzzy c-means algorithm. In Section 4, the simulation of cloud network information data mining algorithm is simulated, and the simulation results are discussed. Section 5 summarizes the full text.

II. CLUSTERING ANALYSIS OF CLOUD NETWORK INFORMATION IN DATA MINING TECHNOLOGY A. DATA MINING TECHNOLOGY
Data mining is the process of extracting the hidden and potentially useful knowledge and information from a large amount of incomplete, noisy, fuzzy and random data [42]- [44]. This contains two levels of meaning: (1) the data source must be real, large, and noisy; (2) this knowledge is implicit and potentially unknown useful information in advance, and the extracted knowledge is expressed as concepts and rules.
Data mining means a decision support process that looks for patterns in a collection of facts or observations. A pattern is an expression E expressed in the language L. It can be used to describe the characteristics of the data in the data set F. The data described by E is a subset Fe of the set F. E as a model requires it to be simpler than the description method of enumerating all elements in the data subset Fe. The object of data mining is not only a database, but also a file system, or any other data collection organized together, such as cloud network information resources.
Data mining systematic input is the data of the database, the guidance of the information analyst, and the knowledge and rules stored in the knowledge base of the mining system. The selected data is processed in various mining modules to generate auxiliary patterns and relationships, then evaluate and interact with analysts to find interesting patterns. Some also need to be added to the knowledge base for subsequent extraction and evaluation [45], [46]. The topological structure of the data mining grid is shown in Figure 1.
Machine learning and data mining are most closely related. The main difference between the two is that the task of data mining is to discover understandable knowledge, while machine learning is concerned with improving the performance of the system. So training a neural network to control an inverted stick is a machine learning process, but not data mining. The main object of data mining is large data sets, such as data warehouses, but generally the data sets processed by machine learning are much smaller, so efficiency issues are crucial to data mining.

1) THE PROCESS OF DATA MINING
The data mining process generally consists of three main stages: data preparation, mining operations, result expression and interpretation.
(1) Data preparation stage: This stage can be further divided into 3 sub-steps: data integration, data selection, and data preprocessing. Data integration combines the data in multiple files or multiple database operating environments to resolve semantic ambiguities, handle omissions in the data, and clean dirty data. The purpose of data selection is to identify the data set to be analyzed, reduce the processing scope, and improve the quality of data mining. Preprocessing is to overcome the limitations of current data mining tools.
(2) Data mining stage: This stage performs actual mining operations.
(3) Results presentation and interpretation phase: They analyze the extracted information according to the decision purpose of the end user, distinguish the most valuable information, and submit it to the decision maker through a decision support tool. Therefore, the task of this step is not only to express the results (for example, using information visualization methods), but also to filter the information. If the decision maker cannot be satisfied, the above data mining process needs to be repeated.

2) THE MAIN PROBLEMS OF DATA MINING
The main problems of data mining are mainly in the following areas: (1) Mining methods and user interaction issues: This reflects the type of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, specific mining and knowledge display.
(2) Performance issues: This includes the effectiveness, scalability, and parallel processing of data mining algorithms. In order to effectively extract information from a large amount of data in a database, the data mining algorithm must be efficient and scalable. In other words, for large databases, the running time of the data mining algorithm must be predictable and acceptable. On the other hand, the large capacity of many databases, the widespread distribution of data, and the computational complexity of some data mining algorithms are factors that facilitate the development of parallel and distributed data mining algorithms. These algorithms divide the data into parts that can be processed in parallel and then combine the results of each part. In addition, the high cost of some algorithms in the data mining process has led to the need for incremental data mining algorithms. Incremental algorithms are combined with database updates without having to re-mine all data. This algorithm incrementally updates knowledge, modifies and strengthens previously discovered knowledge.
(3) Diversity of database types: There are various data storage methods, including relational databases, data warehouses, transaction databases, advanced database systems, spatial databases, text databases, multimedia databases, heterogeneous databases and heritage databases. It is important to develop data mining system on this basis. Due to the diversity of data types and the different goals of mining, it is unrealistic to expect a system to mine all types of data. In order to mine a specific type of data, a specific data mining system should be constructed, so that for different types of data, we may have different data mining systems. On the other hand, discovering knowledge from different structured, semistructured, and unstructured data sources with different data semantics poses a huge challenge to data mining.

B. CLOUD COMPUTING ARCHITECTURE
The reason why the cloud computing platform is called ''cloud'' is that it has a huge ''cloud'' network, a powerful computer cluster to provide network computing and services, and it can manipulate virtualization technology to use various terminals to obtain services at anytime and anywhere to concentrate massive resources. Its cloud computing architecture is shown in Figure 2.
1) Cloud client: They login to the request service portal, and send requests to the server through the cloud client to achieve the interaction between the client and the server, realizing user account registration, resource configuration, custom services and other functions.
2) Service directory: After the user obtains the login right, he can customize the required services through this service directory, or cancel the existing service items. The cloud display interface generates corresponding service icons for users to browse based on the existing services.
3) Management system and deployment tools: They manage and deploy the entire cloud, optimize the allocation and utilization of resources, and schedule the effective allocation of resources.

4) Monitoring:
Real-time monitoring of the data usage status of the cloud system is to ensure the reasonable allocation of resources. 5) Server cluster: They use multiple servers to implement parallel computing. Its main duties are to handle requests from multiple users, parallel processing of big data, and backup and storage of data.
Users can log in to the server through the cloud client, select the cloud service they need from the service catalog, and send the request to the server cluster. The server schedules the calculation of the response through the management system and deployment tools, and returns it to the cloud client. Deployment tools allocate resources, and configure web applications.

C. FORMAL DESCRIPTION OF CLUSTERING
Clustering is a major technique in data mining, which is to group a group of individuals into several categories according to similarity, that is, ''things are clustered in categories.'' Its main purpose is to make the distance between individuals belonging to the same category as small as possible, while the distance between individuals in different categories is as large as possible. The fundamental difference between clustering and classification is: in the classification problem, we know the classification attribute of the training example, and in clustering, we need to find this classification attribute value in VOLUME 8, 2020 the training example. Clustering methods include statistical methods, machine learning methods, neural network methods, and database-oriented methods.
In statistical methods, clustering is called cluster analysis, and it is one of the three major methods of multivariate data analysis (the other two are regression analysis and discriminant analysis). It mainly studies clustering based on geometric distances, such as Euclidean distance and Minkowski distance. Traditional statistical clustering analysis methods include systematic clustering, decomposition, joining, dynamic clustering, and ordered sample clustering, overlapping clustering and fuzzy clustering.
Clustering in machine learning is called unsupervised or teacherless induction. Because compared with classification learning, examples of classification learning or data objects have class labels, while examples to be clustered are not labeled, and need to be automatically determined by the clustering learning algorithm. Conceptual clustering algorithms in the field of machine learning perform clustering through symbol attributes and derive a conceptual description of the clustering. When clustering objects can increase dynamically, concept clustering is called concept formation.
In neural networks, there is a class of unsupervised learning methods: self-organizing neural network methods, such as Kohonen self-organizing feature mapping networks, competitive learning networks, and so on. The SOM method in neural networks clusters data through repeated learning. It consists of an input layer and a competition layer. The input layer consists of N input neurons, and the competition layer consists of m × n = M output neurons, and forms a two-dimensional planar array. The neurons in the input layer are fully interconnected. The LBG method in the vector quantization VQ method can only cluster numerical attributes. The usual approach is to divide all the sets of vectors to be identified into several subsets, and the vectors in each subset have similar characteristics, so they can be represented by a representative quantity. This representative vector is called a codeword, and the set of all codewords is called a codebook.
The cluster analysis problem can be described as: given n vectors in m-dimensional space R m , we assign each vector to one of the S clusters, so that the ''distance'' between each vector and its cluster center is the smallest. The essence of the cluster analysis problem is a global optimization problem. Here, m can be regarded as the number of attributes that the sample participates in clustering, n is the number of samples, and S is the number of classifications set by the user in advance.
The vectors X i and X j in the m-dimensional space R m are: Then the distance between the vectors X i and X j can be defined as:

1) INTERVAL SCALE VARIABLES
The unit of measurement chosen will directly affect the results of the cluster analysis. In general, the smaller the unit selected, the larger the possible range of the variable, and the greater the impact on the clustering result. Therefore, in order to avoid the dependence of clustering results on unit selection, the data should be standardized. After normalization, the dissimilarity between objects is calculated based on the distance. The most commonly used distance metric is Euclidean distance, which is defined as: Here . , x jp ) are two p-dimensional data objects. When using Euclidean distance, special attention should be paid to the selection of the measured values of the samples, which should effectively reflect the characteristics of the category attributes. Two other well-known methods are Manhattan distance and Minkowski distance, which are: It can be seen from the Minkowski distance: when q = 1, it represents the Manhattan distance; when q = 2, it represents the Euclidean distance.

2) BINARY VARIABLES
A binary variable has only two states: 0 or 1. 0 indicates that the variable is empty, and 1 indicates that the variable exists. For example, given a variable smoker that describes a patient, 1 means that the patient smokes, and 0 means that the patient does not smoke. If the binary variables have the same weight, you can get a table of possibilities with two rows and two columns, as shown in Table 1. Table 1 reflects the possibility of variable values for the two objects. In the table, q is the number of variables where object i and object j both have the value of 1, r is the number of variables where object i has the value of 1 and object j has the value of 0, s is the value of object i and the value of j is 0, 1 is the number of variables, and t is the number of variables, 53398 VOLUME 8, 2020 whose objects i and j are both 0. The total number of variables is p, p = q + r + s + t.
A binary variable is symmetric, if its two states are of equal value and have the same weight. At this time, the evaluation of the dissimilarity between the two objects i and j is the most famous simple matching coefficient, which is defined as follows: That is, the difference is divided by the sum of the same point and the number of different points.
If the output of two states of a binary variable is not equally important, then the binary variable is asymmetric. According to the convention, we encode the more important output, usually a result with a small chance of occurrence, as 1 and the other as 0. Given two asymmetric binary variables, the case where both take 1 is considered more meaningful than the case where both take 0. The calculation of the dissimilarity at this time uses the evaluation coefficient Jaccard coefficient, which is defined as: Partition-based clustering is the most widely used clustering.
The purpose is to divide the data set into several subsets, that is, given a data set with n tuples or records, we construct k groups, each group representing a cluster (k < n). For a given k, an initial grouping method can be given, and the grouping is changed by repeated iterations in the future, so that each improved grouping scheme is better than the previous one. Common clustering algorithms include K-means, K-center, CLARA (Clustering LARge Applications), CLARANS (Clustering LARge Applications based upon RAndomized Search), and so on. Dividing clustering algorithms generally requires all data to be loaded into memory, limiting their application to large-scale data. They also require users to specify the number of clusters in advance, but in most practical applications, the final number of clusters is unknown. In addition, the partitioning clustering algorithm uses only a fixed principle to determine clustering. This makes the clustering result unsatisfactory when the shape of the cluster is irregular or the size is very different.

2) DENSITY-BASED CLUSTERING ALGORITHM
The density-based clustering method uses points with similar density as clusters according to the difference in spatial density, and can be extended in any direction as the density changes. The main idea is: as long as the number of objects or data points in the neighboring area exceeds a certain threshold, clustering continues.
The density-based clustering method treats clusters as high-density object regions divided by low-density regions in the data space. The advantage is that it can be scanned once and any shape and number can be found in the spatial database with ''noise'' clustering.

3) GRID-BASED CLUSTERING ALGORITHM
The grid-based clustering method refers to the use of a multi-resolution grid data structure, which transforms the processing of points into the processing of space, and achieves the purpose of data clustering by dividing the space. It divides the data space into a grid structure of a limited number of units, and all processing is targeted at a single unit. However, all grid clustering algorithms have the problem of quantization scale. In general, the division is too rough, which increases the possibility that objects of different clusters are divided into the same unit. Conversely, too detailed division will result in many small clusters. The usual method is to start by looking for clusters from small cells, then gradually increase the volume of the cells, and repeat this process until satisfactory clusters are found.

III. SCALABLE PARALLEL FUZZY C-MEANS ALGORITHM A. IMPROVED FUZZY C-MEANS ALGORITHM
Since the fuzzy entropy H (x) is a strictly convex function, the fuzzy entropy can be added as an adjustment function to the objective function of fuzzy c-means.
The objective function added with fuzzy entropy is: The derived membership is: The cluster center is: where n is the number of data objects, c is the number of clusters, m is the weight index, w is the adjustment factor, d ij is the distance between the j-th data and the i-th cluster center, and u ij is the j-th data belonging to the degree of the i-th cluster center, and it is a probability value. The traditional fuzzy c-means clustering algorithm only considers the Euclidean distance between the data object and the cluster center, and ignores the interaction between the membership of the same data object and different cluster centers. And fuzzy entropy can just make up for the above shortcomings. At the same time, the introduction of the adjustment factor w can well reflect the distribution characteristics of the data set. The membership calculation formula with fuzzy entropy has a Gaussian distribution, so that data points near the cluster center have a higher probability of belonging to the cluster center, and data points farther from the cluster center belong to the cluster. The probability of the center is relatively small, which can effectively suppress the noise data. The algorithm flow is shown in Figure 3.

B. IMPROVED K-MEANS INITIALIZATION ALGORITHM
k-means is a widely used clustering technique designed to minimize the average Euclidean distance between objects in the same genus. Its simple and fast characteristics are very attractive in practice. In practice, k-means usually requires fewer iterations, making it much faster than similar algorithms, but such fastness and simplicity comes at the cost of accuracy, and k-means is sensitive to initialization, but the algorithm uses unlimited randomness. Initialization, although it brings simplicity and efficiency to the execution of the algorithm, the obtained clustering effect varies greatly with the initial clustering center, so in practice, it is often run multiple times to average, which greatly reduces the practicality.
The k-means initialization seeding technology opens a new way to enhance the k-means clustering effect from the initialization stage. By adding an initialization process to the cluster center value based on probability, the speed of the k-means algorithm is significantly improved. And it gives a lower bound on precision guarantee. However, the operation steps of k-means ++ are inherently sequential. The entire sample set must be scanned multiple times and the subsequent operations depend on the previous results. As a result, the algorithm cannot be extended and is not suitable for large-scale data sets.
A parallel implementation of k-means ++, which includes an inherent execution order, is called k-means ++. The algorithm is simple and highly parallel, and it is easy to implement on any parallel computing model. Theoretically, it can be proved that k-means// approximates the optimal solution with a constant factor. Under the premise of ensuring accuracy, it can effectively reduce the number of sample scanning and algorithm iterations.

1) K-MEANS ++
k-means starts by randomly selecting a set of clustering centers, while k-means ++ proposes a special method for selecting these centers. Let X = {x 1 , . . . , x n } be a sample set in a d-dimensional Euclidean space, the number of generics is k, and any sample x ∈ X and a sample subset Y ⊆ X are defined to define a distance: Let the cluster center set be C = {c 1 , . . . , c k }, and define the cost of the sample set Y relative to the cluster center C as: (13) k-means ++ is a fast and simple clustering initialization technology, but it can give an optimal clustering center set O (logk) times different from the optimal clustering center. If you sum the cost of all points in the sample, you let φ be the sum of the cost of all points under the optimal cluster center set condition as φ OPT . It can be theoretically proved that under the condition of the cluster center set constructed using k-means ++ technology, the corresponding cost sum is E [φ] 8 or less (lnk + 2) φ OPT . Compared with the original version of k-means, the initial clustering center is randomly selected, and the set obtained by k-means ++ is used as the initial clustering center for the iteration of the Lloyd body of k-means. It reduces the number of iterations and reduces the clustering uncertainty caused by initialization sensitivity, which greatly enhances the robustness of k-means.

2) K-MEANS//
The main disadvantage of k-means ++ is its inherent sequential execution characteristics. To obtain k cluster centers, the data set must be traversed k times, and the calculation of the current cluster center depends on all the cluster centers obtained previously, which makes the algorithm unable to parallelize. The extension greatly limits the application of the algorithm on large-scale data sets. The main idea of k-means// is to change the sampling strategy during each traversal. After repeated sampling, a set of O (klogn) sample points is obtained. The set is approximated by an optimal solution with a constant factor, and then the O (klogn) points are clustered into k points. The k points are sent to the Lloyd iteration as the initial clustering center. Generally, 5 repeated sampling can get a good initial clustering center.
k-means// is largely inspired by k-means ++. The algorithm first randomly selects the same point as the cluster center, and calculates this point as the sum of the costs ψ of all sample points under the condition of the cluster center. The sampled samples are added to the clustering center set C, and the value of φX (C) is updated at the same time, and the sampling continues. The points are expected to be sampled in each cycle, and log ψ sample points are expected to be included in C after the end of the cycle, so the number of samples in C exceeds k. Finally, the weight is based on the number of samples divided into each cluster center in C, and log ψ points are weighted and then clustered into k points. Generally, the number of samples in C will be much smaller than the number of all samples. Clustering can be done quickly.

C. SCALABLE PARALLEL FUZZY C-MEANS
This section presents a parallel scalable fuzzy c-means algorithm that combines the Spark programming model and special initialization methods. Although Hadoop is now the most popular distributed processing framework, compared to Hadoop, Spark can provide a richer programming model and more effective support for iterative, interactive tasks. In order to get better initialization process and better algorithm performance, this paper introduces a specific initialization method developed from k-means ++ and k-means//. A series of elastic distributed data operations provided by Spark, that is, a series of transformation and behavior functions, are used to implement our parallel scalable fuzzy c-means algorithm.
The scalable parallel fuzzy c-means algorithm combines a special initialization process with the main iteration of fuzzy c-means. Both parts are designed based on the Spark programming model. Spark's distributed programming method uses multiple elastic distributed data operations to build distributed applications. Multiple map operations such as flat map and map partitions are used to distribute parallel tasks from the driver to each worker node. The operations are used to collect child run results from each worker node and return them to the driver. Because the algorithm pseudocode involves many operation functions of elastic distributed data, here we will first use a list of important elastic distributed data operations, as shown in Table 2.
Fuzzy c-means is sensitive to initialization. Initialization has an impact on the iterative process and results. Poor initialization can cause too many iterations and the result converges to a local optimum. Random initialization cannot guarantee a stable iterative process and clustering quality. In the scenario of big data analysis, because cluster computing is used to perform distributed calculations, the cost of each task analysis is relatively high. Therefore, the algorithms commonly used in big data analysis generally can guarantee a relatively stable algorithm process and quality. The algorithm of averaging multiple operations is usually not used for large-scale calculations in a distributed environment. The introduction of the popularized k-means// can get an approximate optimal initial cluster center, which improves the speed of algorithm convergence and reduces the number of iterations. It stabilizes the algorithm iteration process while ensuring the quality of the algorithm and avoiding the algorithm in a distributed environment.
The algorithm initialization part generalizes the algorithm idea of k-means// sampling by probability and subsampling to obtain a faster convergence and higher quality initialization result. Sampling by probability refers to sampling according to the probability that the individual's contribution to the clustering objective function accounts for the objective function and value of the entire sample. Sub-sampling refers to that during the initialization process, a number of samples are probabilistically sampled in each of the blocks in a distributed manner, and then the obtained samples are sampled locally to obtain c samples as the initial clustering center. The algorithm can be roughly divided into three stages. Firstly, the sampling is based on the percentage probability of the sample cost as a percentage of the total cost. Finally, a fuzzy c-means clustering is performed locally on the driver side, and the few samples obtained are clustered into c cluster centers as the output of the initialization process.
Because the algorithm design is based on the Spark programming model and uses the data operations and intermediate multiplexing provided by the model, the function primitives are directly programmed into Spark when the algorithm is expressed. The meaning of these functions can be seen in Table 2. The algorithm input includes the object file submitted to the shared file system, a parameter init for the number of iterations of the initialization process, and the sampling factor in k-means . The center set will be sent to the main iteration as the initial cluster center.
The algorithm steps are expressed using the model of elastic distributed data operation. The target file is created as elastic distributed data, and any point in the data set is taken as the first initial clustering center. The core iteration in the initialization process is developed from the k-means // algorithm. Given a constant value for the specific number of iterations, k-means // thinks that iterations can achieve good results after five or more iterations. In the iterative loop, when the current clustering center is C, the sum of the sample costs of all samples X is: After explaining the meaning of each step in the algorithm initialization process, we describe the algorithm idea of the entire initialization process: (1) The first stage is distributed probability sampling of the entire sample. Each data set is first flattened into one dimension, the sample cost of each sample is calculated one by one, and then the total cost of the sample set is reduced and added. The total cost obtained is sent to each block, and samples are sampled according to probability independently in each block. The samples obtained from each block are collected, and a small number of samples with a large proportion of the total cost of the sample are collected.
(2) In the second stage, a small number of samples obtained are weighted and sampled according to weighted probability. We divide other samples into these sample categories, and calculate the number of samples divided into these small sample categories in a distributed manner, and use this as the weight when the sample is used as the cluster center, that is, consider these small samples as the cluster center.
(3) The third stage performs a local fast fuzzy c-means clustering. A smaller number of samples are quickly clustered 53402 VOLUME 8, 2020 into c, and these c samples are used as the initial cluster center set.
The algorithm is designed in parallel and implemented based on a distributed model. It has high scalability. Such scalability is mainly reflected in three aspects: processing scale expansion, vertical expansion, and horizontal expansion. Processing scale scalability mainly refers to the scalability of the algorithm in the size of the data that can be processed. The algorithm can effectively support data analysis tasks that are applied without any modification to the expansion of the data volume. Both vertical and horizontal scalability are related to the distributed architecture selected by the algorithm. For vertical scalability, if the cluster is upgraded by adding computing resources such as processors, the algorithm can effectively use the application performance  improvement brought by the cluster upgrade without modification. In terms of horizontal scalability, if more servers are added to the cluster to enhance the computing power of the cluster, there is no need to modify the algorithm, which can provide large-scale cluster analysis support.

IV. SIMULATION AND RESULTS ANALYSIS OF DATA MINING ALGORITHMS A. EXPERIMENTS ON A SIMPLE DATA SET
A set of artificial simple data sets is shown in Figure 4.
This data set includes valid data and noise data. Randomly we select 18 numbers in this set of data for numbering, from 1 to 18. Assume that the noisy data points are numbered 8 and 9, and then we apply the combined algorithm to this data set. The simulation results of the three algorithms are compared. Table 3 lists the membership values of each data point using the traditional fuzzy c-means clustering algorithm, fuzzy c-means clustering algorithm with fuzzy entropy, and scalable parallel fuzzy c-means clustering algorithm. Figure 5 shows the trend graph of membership of each algorithm. It can be seen that the traditional fuzzy c-means clustering algorithm has poor anti-noise performance. The noise data is regarded as valid data during the clustering process. When the clustering validity function is added, the two types of data are aggregated into three categories, and the noise data is aggregated into one category as valid data, which leads to incorrect clustering results. The fuzzy c-means clustering algorithm with fuzzy entropy and scalable parallel membership of noisy data points 8 and 9 are close to 0. When VOLUME 8, 2020  adding the clustering validity function to optimize the number of clusters, it is not affected by the noisy data. The correct clustering results are obtained and the anti-noise performance is good. Table 4 shows the values of the clustering effectiveness functions, accuracy, precision, sensitivity, and specificity corresponding to the three algorithms. The scalable parallel clustering efficiency function has the smallest value, and its performance is better than fuzzy c-means clustering algorithm based on fuzzy entropy, which greatly improves the performance of traditional fuzzy c-means algorithm. The traditional fuzzy c-means algorithm has deviations in accuracy, precision, sensitivity, and specificity, because the noise points are considered as valid data for clustering. But adding fuzzy entropy and scalable parallel c-means algorithm has better anti-noise performance.

B. EXPERIMENTS ON COMPLEX 2D DATASETS
A set of artificially complex two-dimensional data sets is shown in Figure 6.
The Gaussian noise with an average value of 0.51 and a variance of 0.216 was artificially added to this data set, and the combined algorithm was applied to this data set. Figure 7 shows the clustering center convergence trajectory of the fuzzy c-means algorithm without the opponent suppression method and the fuzzy c-means algorithm with the opponent suppression method. It can be seen from the comparison that the convergence rate of the clustering center of the fuzzy c-means clustering algorithm with the adversary suppression method is significantly faster, and the optimal number of clusters c is automatically determined to be 4 due to the addition of the clustering validity function. Figure 8 shows the convergence trend of the objective function of the fuzzy c-means algorithm without the opponent suppression method and the fuzzy c-means algorithm with the opponent suppression method. It can be clearly seen that the objective function of the fuzzy c-means clustering algorithm with opponent suppression converges to the minimum value quickly, while the traditional fuzzy c-means clustering algorithm has a relatively slower convergence of the objective function. Figure 9 shows the clustering results of the combined fuzzy c-means clustering algorithm. It can be seen that due to the addition of the clustering validity function, each algorithm can automatically and accurately determine the optimal number of clusters c = 4. The traditional fuzzy c-means algorithm has poor anti-noise performance, and clusters the noisy data as valid data. The addition of fuzzy entropy and scalable parallel fuzzy c-means clustering algorithm can eliminate the effect of some noise data on the valid data. Among them, the scalable parallel fuzzy c-means clustering algorithm has the best anti-noise performance, and the noise data has the least impact on the clustering results. At the same time, the convergence of the algorithm is fast due to the integration of adversarial suppression methods. Table 5 lists the performance index values of the three algorithms. The scalable parallel clustering function has the smallest value, and its performance is better than the fuzzy c-means algorithm with fuzzy entropy. The fuzzy c-means algorithm based on scalable parallel constraints takes into account the differences between different classes, that is, it maximizes the dissimilarity between different classes, and has the ability to assign low membership values to noisy data points, so it has good anti-noise performance.
In order to further verify that the anti-noise performance of the scalable parallel fuzzy c-means algorithm is better than the traditional fuzzy c-means algorithm, and better than the fuzzy c-means algorithm with fuzzy entropy, the artificial IRIS data set is added artificially. The experimental results are shown in Figure 10.
From the picture above, we can get: (1) Fuzzy entropy and scalable parallel fuzzy c-means algorithm have better anti-noise performance, because after the introduction of information entropy, the iterative process of the algorithm changes from the original uniform contraction to the uneven contraction. The shrinking direction shrinks, making the final clustering result more consistent with the actual distribution; (2) The error rate of the fuzzy c-means algorithm with information entropy is low, because this algorithm not only considers the information of the data set, but also the influence of membership, and also introduces adjustments in the fuzzy entropy-based clustering algorithm. Factor w, based on the scalable parallel clustering algorithm, makes the membership calculation formula have the characteristics of Gaussian distribution. The probability of data points belonging to the cluster center is relatively small, thereby effectively suppressing the impact of noise data on the cluster center; (3) The scalable parallel fuzzy c-means clustering algorithm has the lowest error rate and the best anti-noise performance. This is because the scalable parallel fuzzy c-means clustering algorithm considers the same data object and different clusters. The influence of other classes on this class is also considered, making the clustering results more accurate.

V. CONCLUSION
As an important part of data mining, cluster analysis has been widely used in various fields. Although various clustering algorithms have been proposed, different algorithms have their own characteristics and are used in different environments and fields. This paper proposes a scalable parallel fuzzy c-means clustering algorithm, combined with the Spark programming model. It improves the fuzzy c-means algorithm based on special initialization methods. While ensuring the effective clustering ability of the algorithm, the fuzzy c-means can be applied to distributed scenarios in a parallel and highly scalable manner, so that the algorithm can effectively perform cluster analysis of large-scale data after parallelized design and initialization and improvement work. For the improvement of fuzzy c-means random initialization, the improvement strategy of k-means is used to improve the time performance and accuracy of the overall algorithm while obtaining a better initial cluster prototype. The improved analysis methods are integrated, and the combined improved algorithms are used for simple data sets and complex two-dimensional data sets, respectively. The clustering effectiveness function, accuracy, precision, sensitivity, and specificity were calculated to evaluate the clustering results, and a good clustering effect was obtained. However, the methods proposed in this paper are all researched and operated on numerical data. How to effectively apply the existing clustering methods to non-numerical attributes is a problem that needs to be studied in the next step. YUXING XIANG was born in Chongqing, China, in 1997. He received the bachelor's degree from Chongqing Technology and Business University, in 2019. He is currently pursuing the degree with the School of Information and Electrical Engineering, Hebei University of Engineering. His research interests include data mining and machine learning.
RUIXIAO ZHAO was born in Henan, China, in 1996. She received the bachelor's degree from the Xinxiang University of Science and Technology, in 2018. She is currently pursuing the degree wit the School of Information and Electrical Engineering, Hebei University of Engineering. Her research interests include data mining and machine learning. VOLUME 8, 2020