Random Partition Based Adaptive Distributed Kernelized SVM for Big Data

In this paper, we present a distributed classification technique for big data by efficiently using distributed storage architecture and data processing units of a cluster. While handling such large data, the existing approaches consider specific data partitioning techniques which demand complete data be processed before partitioning. This leads to an excessive overhead of high computation and data communication. The proposed method does not require any pre-structured data partitioning technique and is also adaptive to big data mining tools. We hypothesize that an effective aggregation of the information generated from data partitions by subprocesses of the complete learning process can lead to accurate prediction results while reducing the overall time complexity. We build three SVM based classifiers, namely one phase voting SVM (1PVSVM), two phase voting SVM (2PVSVM), and similarity based SVM (SIMSVM). Each of these classifiers utilizes the support vectors as the local information to construct the synthesized learner for efficiently reducing the training time and ensuring minimal communication between processing units. In this context, an extensive empirical analysis demonstrates the effectiveness of our classifiers when compared to other existing approaches on several benchmark datasets. However, among existing methods and three of our proposed (1PVSVM, 2PVSIM, and SIMSVM) methods, SIMSVM is the most efficient. Considering MNIST dataset, SIMSVM achieves an average speedup ratio of 0.78 and minimum scalability of 73% when the data size is scaled up to 10 times. It also retains high accuracy (99%) similar to centralized approaches.

learners are used to generate the final learning results. The 77 advantage achieved by this type of solution is a reduction 78 in time complexity because of the distribution of data pro-79 cessing over a cluster. Here, cluster is a set of nodes where 80 each node contains a processing unit and a distributed storage 81 space. 82 Revisiting all the above mentioned challenges, in this 83 paper, we propose three different solutions for distributed 84 learning using SVM: i) one phase voting SVM (1PVSVM), 85 ii) two phase voting SVM (2PVSVM), and iii) similarity 86 based SVM (SIMSVM) for performing classification over 87 large datasets. Each of the proposed schemes considers that 88 data is distributed over a distributed space in form of blocks or 89 chunks. We design our approaches to accommodate the clus-90 ter computing scenario where some blocks are locally avail-91 able to the processing units. The complete proposed model 92 consists of 4 phases; each phase maintains total independence 93 among the parallel executing subproblems by ensuring no 94 data or information exchange among them. An effective inter-95 mediate result aggregation scheme is also proposed to aggre-96 gate the results of subproblems. This aggregation further 97 guarantees that there are no anomalies in the final aggregated 98 result using a deduplication method. Further, our approaches 99 do not require any specific data partitioning scheme and is 100 adaptive to commonly used distributed storage solution like 101 Hadoop Distributed File System (HDFS).

102
Contributions: Some of the key features of the proposed 103 distributed and scalable approaches are as follows: 104 • The approaches allow random data partitioning, i.e., 105 these approaches do not require specific partitioning 106 technique to be applied before distributing data over a 107 distributed storage.

108
• It can handle a large amount of data by efficiently divid-109 ing the learning process over the disjoint subsets of data. 110 • Three distributed solutions, namely, 1PVSVM, 111 2PVSVM, and SIMSVM are proposed to train the clas-112 sifier using linear and non-linear kernels. The techniques 113 have the same accuracy as the native SVM. Moreover, 114 the techniques show a good amount of speedup and are 115 highly scalable with low training time.

116
• The methods are instinctively adaptive to clustered or 117 cloud solutions for large data analysis.

118
• No exchange of data is required between two distributed 119 components resulting in low communication overhead. 120

Roadmap:
The immediate section II demonstrates the 121 need of support vector machine to handle large datasets. The 122 overall description of the proposed approaches is described in 123 section III where section IV explains how to handle multiclass 124 classification problem through the proposed approaches. The 125 ensuring section V comprises the experimental results and 126 analysis on several benchmark datasets to validate the effi-127 cacy of the proposed approaches. Section VI indicates some 128 key observations from this research work with some impor-129 tant research implications. Finally, section VII concludes the 130 paper with the final remarks.  This process automatically distributes the computation over-186 head of the kernel function over different submodules.

187
To further optimize the distance margin maximization pro-188 cess, framed inputs are provided to the SVM. This results 189 in the third type of techniques. In this type of approaches, 190 initially, clustering of the complete dataset is performed and 191 then SVM is applied on the clustered data. The training time 192 can be diminished by performing the division of data such that 193 a pre-clustered data is present in each partition and is used 194 as an input to the classifier for classification. In a distributed 195 manner, different clustering algorithms can be applied such 196 as k-means which divides the data into k clusters [27] before 197 feeding the data into SVM. CSVM [28] uses k-means algo-198 rithm to classify the big data into k independent clusters. Then 199 learning is performed using weighted local SVMs trained in 200 the cluster. To reduce the communication cost, communica-201 tion avoiding support vector machine (CASVM) [29] uses 202 a balanced k-means clustering to reduce the data commu-203 nication. This technique performs better when the data is 204 non-overlapping and performance degrades for overlapping 205 datasets. Another approach, DTSVM [30], applies decision 206 tree algorithms to classify the large data into k disjoint regions 207 (tree leaves). A local SVM is used to classify the regions 208 having heterogeneous points (in context of class labels). So, 209 for any test point, first, decision tree is used to select the 210 appropriate region and then, SVM is used for final class 211 prediction. Using decision tree algorithm, [31] suggested a 212 classification method that identifies low entropy data regions. 213 Later, Fisher's linear discriminant is applied to reduce the 214 overall data by only selecting data from decision tree bound-215 aries. This reduced data is further used by SVM for generation 216 of the separating planes. [32], [33] proposed a distributed 217 SVM approach which is a combination of k local SVMs. 218 The complete data is partitioned into k-clusters and local 219 SVMs are applied over each cluster of the data. A differ-220 ence in [33] approach is that it follows a balanced clustering 221 method for load balancing among the subproblems. One vital 222 issue with this third category is that these methods require 223 pre-clustering of the complete data that means complete data 224 must be present at one place which in turn increases the 225 complexity of the overall process. However, our proposed 226 methods do not have such centralized data requirement and 227 thus, can save such data preparation time.

230
Problem Statement: Large datasets are often stored in 231 small-sized partitions over a distributed storage. Removal of 232 the overhead of any special partitioning, which preserves the 233 centrality of data over a distributed storage, is a challenge 234 for large data mining. This work addresses the challenges 235 in extracting the relevant information from randomly parti-236 tioned dataset subsets and performing the aggregation of this 237 information without any duplication to generate the complete 238 information for classification while maintaining the accuracy 239 and minimizing the training time.       The four phases as shown in Fig. 2, the first phase is data 306 partitioning in which complete data is divided into m subsets. 307 The second phase is executed separately over each partition 308 or a collection of partitions of data. The partitioned data 309 is used to generate the intermediate results in form of sup-310 port vectors which are compressed boundary points for class 311 labeling. In the third phase, the aggregation of the results 312 generated in second phase is performed using a deduplica-313 tion process which removes the duplicate results. Finally, 314 the fourth phase is executed to develop the three types of 315 classifier either by applying any voting scheme (one phase, 316 or two phase) or by applying similarity based approach. 317 Table 1 depicts some commonly used notations in the 318 paper. One of the core challenges with large datasets is that, it is 321 hard to process the complete data in a single iteration because 322 of limited storage and processing power of a system. There-323 fore, data partitioning is an effective way to deal with such 324 massive datasets. In our approach, the complete dataset D is 325 divided into m small blocks/data chunks as {D 1 , D 2 , . . . , D m } 326 and stored over a distributed storage. The analysis over this 327  distributed data mimics a non-uniform sampling process. The 328 entire dataset D is given by: where, x i is the i th feature vector and y i is its class label, the 331 division of D in m non-overlapping subset is expressed as: where, N j is the sample size for j th partition. However, disjoint where, x j is a vector which represents a data point in D i input 359 space, y j ∈ {+1, −1} is the target class label for j th data point, 360 b is a scalar which is a bias term, and W is the coefficient 361 weight vector corresponding to the hyperplane separating two 362 classes.

363
The exploration of all the hyperplanes along with the cor-364 responding support vectors from each partition is done using 365 constrained optimization originally suggested by Vapnik [1]. 366 Suppose, there exists a hyperplane W t x + b = 0 then, the 367 following equation will always remain true: For each point which exists on the hyperplane of any of the 370 classes, the following equality holds: For each independent learner executed over the set of data 373 partition (D i ), same margin is considered. The margin maxi-374 mization is the maximization of distance 2 ||W || between two 375 classes which is the same as minimizing 1 2 ||W ||. The margin 376 minimization function for each partition is given in equation 6 377 in case of hard margin and 7 in case of soft margin.
where, ζ ≥ 0 and a regularization parameter r is added to 383 regularize ζ . This leads to the soft-margin formulation as 384 VOLUME 10, 2022 follows: This optimization problem is solved using Lagrangian dual 388 form. This same problem can be written in matrix form as: where, e is a k × 1 vector of ones. Applying Lagrangian 392 formulation, we get the dual optimization problem as: This optimization problem can be solved using the stan- is a set of support vectors for class j and k while considering 407 the hyperplane between these two classes.
Thus, all the generated sv i are added to the distributed The final phase of our proposed method is performed in 444 three different ways, namely i) one phase voting SVM 445 (1PVSVM), ii) two phase voting SVM (2PVSVM), and iii) 446 similarity based SVM (SIMSVM). 1PVSVM, and 2PVSVM 447 include generation of final cumulative hyperplanes by using a 448 reduced set of points after deduplication process. The primary 449 differences between these two approaches are input selection 450 process, and counting process. The SIMSVM is a similarity 451 based approach which uses a similarity measure to decide the 452 class label for a test point.

488
Any X test point is tested by using a two phase voting 489 scheme as follows:  2) Second, the voting is performed within the separator 494 functions for final classification.
495 Fig. 7 shows the overall process of 2PVSVM which clearly 496 states the differences from 1PVSVM in context of input 497 section process and the counting process for class level gen-498 eration.

500
1PVSVM and 2PVSVM carry out several executions of the 501 complete SVM as both these approaches require to construct 502 the decisive separating planes from the reduced points which 503 increases the high computation requirement in this phase. 504 To prevail this computational overhead, we propose another 505 approach, similarity based SVM (SIMSVM), which exploits 506 reduced points and their class labels from the distributed pool 507 (< x 1 , y1 >, < x2, y2 >, . . . < x z , y z >) as input and predict 508 the class labels for the test points as shown in Fig. 8. Algo-509 rithm 2 demonstrates the complete pseudocode for predicting 510 class label for a test point X test .

511
The classification of a test point X test is done based on a 512 similarity measure between the existing set of reduced points 513 in the distributed pool each of which is associated with a class 514 label. SIMSVM scheme estimates the distance between two 515 points as a similarity measure and test point X test is classified 516 into the class of that reduced point which has the minimum 517 distance from it. Suppose, for a test point X test , the class label 518 y test need to be predicted. Now, y test is labeled as y test = y m 519 if X m is the closest point to X test and y m is the label for X m 520 point. To handle non-linearly separable data, transformation of the 523 data is required in a higher dimension. It is evident that the 524 use of kernel function for measuring the distances in higher 525 VOLUME 10, 2022 for v ← 0 to len(TotalSV ) do 9: if MinDist > dist then tance between two vectors can be calculated as:

IV. HANDLING MULTICLASS CLASSIFICATION PROBLEM 554
SVMs are able to generate binary classifiers, however, large 555 datasets often contain more than two classes. There can be 556 several ways to handle multiclass classification using SVM. 557 In case of our approaches, we perform multiclass classi-558 fication by dividing it into multiple binary classification 559 problems.

560
A. ONE-AGAINST-ONE APPROACH

561
To accomplish multiclass classification, instead of trying to 562 discriminate one class from all the others, this approach dis-563 tinguishes one class from another one. As a result, it trains the 564 model iteratively for different pairs of classes and generates 565 the required hyperplanes and support vectors. for j ← 0 to C − 2 do 5: end for 11: end for 12: for i ← 0 to C − 2 do 13: for j ← i + 1 to C − 1 do 14: 15: weight.append(W ) 16: bias.append(b) 17: end for 18: end for Algorithm 3 demonstrates the complete one-against-one 567 classification process. The loops in the 4 th and 5 th lines lead 568 to C(C − 1)/2 times calculation of support vectors for each 569 dataset subset D i and input data points of two classes at a time 570 as D i (j, k). The sv i is a matrix whose each individual element 571 can be written as sv i (j, k) that denotes the support vectors of 572 class j with the class k in i th subset of the dataset. Similarly, 573 we can find the m number of matrices (where, m is the total 574 number of partitions of data). The dimension of each matrix 575 is C × C which have elements only in the lower triangle and 576 these matrices are combined as i=m i=1 sv i (j, k). This approach 577 is used by each separator function while dealing with 578 multiclass data. 580 We have conducted a rigorous experimental analysis in order 581 to validate the efficacy of our proposed classifiers. The essen-582 tial objectives of this analysis are to answer the following 583 explicit queries while handling a large dataset in a distributed 584 environment as follows:   proposed approaches are implemented on a 10 node cluster, 616 where, each node has 2 CPU cores with 6GB of RAM and 617 a frequency of 2.6 GHz. Each cluster node is capable of 618 running two tasks in parallel. Our proposed approaches have 619 been analyzed using different real-world datasets as shown 620 in Table 2. The datasets are from different domains and are 621 linearly separable/non-linearly separable.

623
We have measured the various schemes based on a few impor-624 tant aspects: the training time and accuracy. To inspect the 625 speedup of the proposed approaches, we calculate the change 626 in training time when new resources are added and to analyze 627 the scalability of the proposed approaches, we calculate the 628 speedup when resources, as well as data size, is increasing. 629 To observe the incremental adaptation, we estimate the accu-630 racy with the addition of each new data chunk. The analysis, shown in Fig. 9, has been done creating several 633 partitions of data for analyzing the amount of information 634 generated by the separator function in phase 2 for Two Moon 635 dataset. The complete dataset has been divided into 10 par-636 titions and stored over a distributed storage from 1 to 10 in 637 two ways using uniform sampling and non-uniform sampling 638 of the data. Each subprocess (mapper) accesses these data 639 blocks and uses Algorithm 1 to process it. After processing, 640 it writes all the results back to the distributed storage in a 641 clustered environment. In the proposed clustered environ-642 ment, instead of processing the complete dataset, each slave 643 node can work on its locally available partition of data. These 644 slave nodes independently process these partitions, and the 645 analysis shows the amount of information generated by slave 646 VOLUME 10, 2022    2) TRAINING TIME ANALYSIS 674 Our proposed classifiers have been trained using different 675 datasets. The complete training time consists of the time 676 required to execute the second, third, and fourth phase of 677 the proposed model. Fig. 10 (shows a log scale training 678 time due to a wide range of training time for different 679 datasets) and Table 3 (shows the actual training time in 680 second) present the comparison of training time taken by 681 the proposed approaches with the existing approaches cre-682 ating four partitions of each dataset. The results from the 683 plots and the table clearly mark the significant improve-684 ments over the existing approaches. 1PVSVM performs bet-685 ter as compared to 2PVSVM except for MNIST8m dataset 686 due to its efficient sampling of only two classes at once. 687 Whereas, SIMSVM exhibits the best performance as com-688 pared to LIBSVM, CASVM and BCSVM because it effi-689 ciently distributes the learning process among several slave 690 processes. SIMSVM gains over BCSVM, as clustering of the 691 data consumes a significant amount of time and the difference 692 can be observed in the Fig. 10. SIMSVM also shows improve-693 ments over 1PVSVM and 2PVSVM as it does not require 694 the separator function to be executed over reduced data 695  points. An analysis has also been conducted for comparing  is first distributed and used as an input to phase 2 of the 727 proposed model. The test data partition is used for testing in 728 phase 4 and the accuracy is measured. Fig.13 and Table 4 evi-729 dently state that there is no significance loss in the accuracy 730 while using the proposed SIMSVM distributed approach as 731 compared to LIBSVM. The proposed approach shows either 732 similar or improvement over the BCSVM approach. This 733 consistency in the accuracy is achieved as the information 734 about each partition of the data is effectively compressed in 735 phase 2 of the proposed model. 1PVSVM shows a bit higher 736 accuracy as compared to 2PVSVM as in case of 2PVSVM it 737 losses a small amount of accuracy due to the random sampling 738 from the reduced points in phase 4. This selection results in 739 less accurate planes between two classes as all points of these 740 classes are not considered at once for generating the final 741 hyperplanes. In contrast, SIMSVM shows better performance 742 than 1PVSVM as it uses a direct similarity measure (along 743 with kernel trick) between a test and a reduced point. The pro-744 posed approaches are also compared with the other machine 745 learning techniques and a comparative analysis is presented 746 in Fig.14. All the three proposed approaches maintain similar 747 accuracy with respect to other machine learning techniques. 748 Each proposed approach is also tested for its implementa-749 tion as the incremental model of classification to serve the 750 purpose of continuous data stream mining. In this analysis, 751 the data partitions are incrementally added to the existing 752 pool of information which is used for testing of the test data. 753 Fig. 15 depicts this basic model used for testing, where, infor-754 mation represents the support vectors extracted after com-755 pleting phase 2 of the proposed model. Each time a new 756 data partition generates the new data points (D i ) which are 757 combined with the existing reduced points (info i−1 ). Thus, 758 a new reduced information set is produced and considered 759 as the information(info i ) for the next iteration. Phase 3 and 4 760 then use this info i for further processing and testing, which 761 results in a better prediction of the class label for the test 762 data. Fig. 16a and 16b plot the classification accuracy of the 763 proposed approaches using MNIST and Two moon dataset 764 by dividing each dataset into 10 disjoint subsets and then 765 incrementally adding these subsets for incremental training. 766 As new data points are added, SIMSVM takes the advantage 767     is four, then four parallel processing modules (mappers) are 791 used, one for each partition (generates SV in phase 2). The 792 objective of this analysis is to observe the speedup in the 793 amount of time required to process the data as the number 794 of processing units increases. We have calculated a speedup 795 ratio measure which is the ratio between training time (TT (1)) 796 required for training over the complete dataset using single 797 processing module and the training time (TT (p)) required for 798 training over p partitions of the dataset using p processing 799 modules.   approaches indicate at least 60% scalability can be achieved 832 even when the size of the problem is scaled to 10 times.

833
As the scale of the problem increases, the minimum scalabil-834 ity that SIMSVM shows is 73% due to its high independence 835 among the processing modules as compared to 1PVSVM and 836 2PVSVM.

837
FIGURE 18. Scalability factor analysis for MNIST dataset by iteratively adding data instances as well as processing modules.

838
Our intended approaches, 1PVSVM, 2PVSVM, and 839 SIMSVM, are the distributed classification techniques that 840 utilize the power of distributed computing to predict the class 841 labels for massive datasets. These approaches are quite appro-842 priate and well adaptive to the recently used data processing 843 models like MapReduce, Spark, etc. In this section, we have 844 highlighted some significant findings from our approach.

845
• It has been empirically observed that adding more 846 resources, the training time can be minimized during 847 the initial additions. After a certain point, depending on 848 the dataset distribution, the rate of decrement in training 849 time gets slow down as at this point each data partition 850 contains the reduced set of data points that are marked 851 as essential points for classification purpose and passed 852 to the next phase.

853
• Although the proposed approaches attain a good amount 854 of accuracy over different datasets similar to other cen-855 tralized and distributed approaches in the literature, one 856 important research question that might arise is the cost 857 associated with the addition of processing units which 858 requires a trade-off between the training time and overall 859 resource cost.

860
• The proposed approaches efficiently retain the accu-861 racy when the distributed data contains imbalanced class 862 distribution. This property makes the approaches more 863 adaptive to real time scenario and removes the overhead 864 of preprocessing of data for its balancing.

865
• We have attempted to present the scalability analysis 866 of the proposed approaches considering addition of a 867 new processing unit for each new data partition. How-868 ever, deciding the number of partitions or processing 869 units is crucial depending on the availability of the 870 resources.

871
• The three proposed approaches are tested for their incre-872 mental versions. Here, again an interesting research 873 question might arise that when to update the classifiers 874 or when to add incremental information to the existing 875 information pool.