Magnetic Force Classifier: A Novel Method for Big Data Classification

There are a plethora of invented classifiers in Machine learning literature, however, there is no optimal classifier in terms of accuracy and time taken to build the trained model, especially with the tremendous development and growth of Big data. Hence, there is still room for improvement. In this paper, we propose a new classification method that is based on the well-known magnetic force. Based on the number of points belonging to a specific class/magnet, the proposed magnetic force (MF) classifier calculates the magnetic force at each discrete point in the feature space. Unknown examples are classified using the magnetic forces recorded in the trained model by various magnets/classes. When compared to existing classifiers, the proposed MF classifier achieves comparable classification accuracy, according to the experimental results utilizing 28 different datasets. More importantly, we found that the proposed MF classifier is significantly faster than all other classifiers tested, particularly when applied to Big datasets and hence could be a viable option for structured Big data classification with some optimization.


I. INTRODUCTION
Artificial intelligence and machine learning in particular is a hot research subarea because of their numerous applications in various fields and contexts. Examples of applications include, but are not limited to: Natural Language Processing [1]- [6], Computer Vision [7]- [10], Game theory [11], [12], Speech Recognition [13], Security [14]- [24], Medical diagnosis [25], [26], Statistical Arbitrage [27], Network Anomaly Detection [28]- [32], Learning associations [33], [34], Prediction [35]- [39], Extraction of information [40]- [43], Biometrics [44]- [46], Regression [47], Financial Services [48]- [53] and Classification [54]- [56]. Depending on their perspective on the problem and the approach utilized, scholars have defined machine learning in different ways. However, the underlying definition is nearly identical across the board and revolves around Brett Lantz's definition, who stated that Machine learning is the process The associate editor coordinating the review of this manuscript and approving it for publication was Chien-Ming Chen . of developing computer algorithms for translating data into intelligence [57]. Supervised learning is a broad topic of machine learning that entails learning a data-to-output label mapping. The most popular subtopic in supervised learning is the classification of labeled data [58].
During the previous few decades, classifiers have been intensively studied and analyzed. Classifiers are widely used in many modern applications as key computer technology. Many classifiers exist, such as K-Nearest Neighbor (KNN), Support vector machine (SVM), Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), and many more. In terms of accuracy and time consuming building a trained model, each classifier has advantages and limitations; some are more effective with specific datasets than others, and hence there is no optimal classifier that can perfectly classify all types of data.
The time spent building the trained model, or the test time if the classifier does not provide a trained model, such as the KNN and its variants, is one of the primary concerns with classical classifiers when employed on Big data. Such a long time consumed makes the classifier impractical to be used for Big data classification.
In order to hasten the learning process, particularly when training Big data, in this paper, we developed a new classifier based on the idea of Magnetic Force (MF), where we programmatically imitate the MF's work, as inspired by the work of [59]. In this approach, each class is represented by a specific magnet, the iron filings represent the unknown data points that need to be classified. The magnetic field around each magnet/class is formed by the examples belonging to the same class. To calculate the force of each magnet/class on each point in the features space, we apply the Inverse Square Law (ISL). Knowing that the feature space may contain continuous data makes calculating the MFs in an infinity space problematic. Therefore, we opt for discretizing the features space by binning each value in each dimension. Thus, the MF for each class can be calculated for each bin. After building the MF model, which is simply based on the training data, any unknown data point (iron filling) will be attracted to the strongest MF and classified by that particular class.
Formally, we can say: Given a numeric dataset D, with dimensionality d, with n training examples belonging to either class A or Class B, if we can discretize these examples by binning them to b discrete values, then we can generate two Magnets (MA and MB) and store their forces in M, which is b × d matrix. Since the MF is disproportionate to the distance between the magnet and the iron filling, we use ISL, as the nearest iron filings are attracted to either MA or MB more than the furthest. we can form two matrices, one for each magnet/class, or one matrix with two values in each cell. Similarly, this definition is also applied for multi-class datasets.
Obviously, the time complexity of the proposed approach is linear O(n), since it depends mainly on the number of examples found in a training set. Particularly, when n d (number of features) of the training data set. And equally important, the small size of the trained model = b × d, if d is relatively small, then the time and space complexity of the generated model will be O(c), where c is a constant. The ability to classify Big data at such a low level of time and space complexity is critical.

II. LITERATURE REVIEW
Although the KNN classifier is extremely slow and lazy learner [60]- [65], it is still used extensively for Big data classification, because it skips the training process, and save its time, which is extremely long for most of the traditional classifiers. However, it is used with a helper to hasten the test phase: indexing, clustering, sorting, and reducing the data among other reprocesses help speed up the classifier when dealing with Big data.
For example, Hassanat [66] created a new approach to sorting the examples of training data into a binary search tree to speed up Big data classification. This has been carried out by using two methods. The first is referred to as Extreme Points-based binary search tree (EPBST), which identifies local points based on their similarity to these global extremes. And the second referred to as Random-Points-based binary search tree (RPBST), selects the local points randomly. Different experiments on medium and large datasets show reasonable accuracy rates compared to other related methods, including the pure KNN classifier.
Another approach by the same author [67] was proposed to classify Big data using the KNN classifier. The KNN classifier was used in conjunction with inserting training examples into a binary search tree (BST) to speed up the search process, having known that a BST can be searched in logarithmic time. In their work, they examined two methods to sort the training examples. The first is called NBT, which calculates the minimum/maximum criterion measured and rounds it to 0 or 1 for each instance. While the second one, which is called MNBT, inserts each instance in the BST depending on its similarity with the minimum and maximum instances, their experiments show competing accuracy and fast classification compared to the other methods examined.
In a similar work [68], the author created a similar BST, which is used by the KNN classifier to classify Big data. The difference is that this method is based on finding the diameter of a data set, which is then used to sort the BST for training dataset examples. This method has a high potential for classifying Big data and can also be generalized to other applications, particularly when speed is a key factor. The outermost pair of examples could be found for each BST node, and the examples in a particular node are sorted after which the generated BTS could be searched. The examples in this paper are used to classify the test example using the KNN classifier. The results showed the method's efficiency in terms of speed and accuracy compared to the other methods examined. However, compared to the pure KNN, the accuracy rates are not ideal, and it needs more improvements.
The diameter of a dataset used in the previous work was found using methods proposed by A [69]. Where they implemented four simple algorithms to approximate the diameter of a multi-dimensional dataset because most algorithms do not fit well with large values of data dimensions since time complexity grows significantly in most cases. The results of the experiments conducted on different machine learning datasets prove the efficiency of the implemented algorithms. These methods are helpful to find the diameter of a Big dataset, which might be used by machine learning applications. For example, the authors of [70] utilize the diameter concept to solve the imbalanced data problem by synthesizing similar minority examples, using an over-sampling technique based on the furthest neighbour algorithm.
In order to increase the performance of the previous work [66]- [68] in terms of computation time, space consumption, and accuracy, [71] the main contribution is the conversion of the resultant BST to a decision tree. This eliminates the need for the slow KNN and results in a smaller tree, which is useful for memory usage, speeds up both the training and testing phases, and improves the classification accuracy as well. The reported experimental results show that the proposed methods are more efficient in terms of space, speed, and accuracy than [66]- [68] and other methods compared. Indicating that the proposed methods have a lot of room for improvement before they can be used in practice.
The authors of [72] proposed a new approach that improves the instance-based KNN algorithm, their clustering-based strategy was adopted to reduce the computational cost of the KNN algorithm. The performance improved by solving the problem of instances by expanding its size using neural networks to obtain a suitable representation of the classification process. The reported results show better performance compared to the other methods examined.
Another approach [73] proposes a new MapReduce-based approach to classify Big data. The map stage finds the k-nearest neighbors in various data partitions. Following that, the reduce stage computes the exact neighbors from the map stage's list. The proposed approach enables the KNN classifier to scale to any dataset size by adding more nodes as needed. Furthermore, like with the original KNN classifier, this kind of parallel implementation gives the same classification rate. The reported results utilizing a dataset with up to 1 million instances reveal that the proposed approach scales well with Big data.
Another MapReduce-based approach is proposed by [74], The main focus of this proposal is on the test set management, with the goal of maintaining the test set in memory wherever possible. Otherwise, it is partitioned into the smallest number of chunks possible, utilizing MapReduce per chunk and Spark's caching capabilities to reuse the formerly partitioned training data. The reported results utilizing a dataset with up to 11 million examples reveal that the proposed approach scales well with Big data.
A multivariable random decision tree is proposed by [75], the purpose of this method is to speed up the classification process utilizing two partition methods. The first randomly partitions the data, referred to as (MDT1), while the second (MDT2) uses Principal Component Analysis (PCA) for partitioning the data. According to the reported results, both methods allow for short training time and competing accuracy on Big datasets.
For speeding up the KNN classifier, [76] proposed two methods. The first uses random clustering (RC-KNN), and the second, use landmark spectral clustering (LC-KNN). the clustering process is based on the well-known K-means clustering, which is mainly used to split the entire data set into several smaller parts. This makes applying the KNN to test input instances much faster than applying it on the entire dataset. The reported results of both methods LC-KNN, and RC-KNN showed better performance when tested on Big datasets.
In [77], the authors conducted a survey-like study that included the-state-of-the-art techniques and tools used for big data classification. Also, they analyse various approaches that have been followed for big data classification. The advantages and disadvantages of each approach were discussed for different types of machine learning techniques Finally they provided the readers with a survey of many open-source libraries that have been used in big data.

III. THE MAGNETIC FORCE CLASSIFIER (MF)
In this paper, We propose a supervised machine learning Magnetic Force classifier (MF) based on the concept of magnetic force attracting an iron filling. e.g. if we place a piece of iron between two or more magnets, it will be attracted to one of them, which will exert a greater force on it due to the magnet force and the distance from the magnet.
For example, assume that we have three magnets (M1, M2, and M3), an iron filling (F), and a set of arrows representing magnetic forces and directions of attraction towards each magnet, as shown in Figure 1. The blue scale color indicates the presence of varied magnetic strengths in the area; the fainter the hue, the stronger the magnetic force. The highest magnetic force found at each point determines the direction of each arrow, as seen in this diagram. Despite the fact that F is closer to M2 and M3, it is drawn to M1 in this case because M1's force on F is stronger.
We simulate this physical process by representing each class in a training dataset by a magnet with a specific force for each point in the feature space. The training process of the MF classifier ends up generating a trained model that incorporates all of the magnetic forces for each class for each point in the feature space. However, because the feature space is typically defined in terms of real numbers, the model's size would be infinite if each feature were not digitized into discrete values. Binning the training data set allows the MF at each location for each class/magnet to be determined.
We used ISL to properly reflect the magnetic force as being inversely proportional to the distance at each point belonging to each class, with say 1 Tesla added to the force of its magnet/class at that particular point, but it affects all the other locations by adding a fraction of force (FF) as follows where d is the block distance from the current cell. The output model is a 2D matrix of vectors, where each vector in each cell contains the magnetic force of each class. Table 1 shows the MF trained model of a toy example dataset with two classes/magnets and two features using 10 bins. As also shown in Figure 2.
As demonstrated in Figure 2, larger values exhibit stronger    During the testing phase, we assume that each tested example is represented by an iron filling, and we assign the test example to the class with the maximum total magnetic force by reading the MF model at the location of the iron filling for each feature, as follows: where C is the predicted class, m is the number of features, MF k is the magnetic force of feature k, and n is the number of classes. For example, assume we have a point P 1 = (1, 2), according to the toy example model shown in Table 1, MF 1 for feature 1 = 4.20 and MF 1 forfeature 2 = 6.92, then C 1 , which represents the MF of class 1 = 11.12. Similarly, VOLUME 10, 2022 MF 2 for feature 1 = 0.33, MF 2 for feature 2 = 0.51, then C 2 , which represents the MF of class 2 = 0.84. And therefore, the predicted class should be class 1 according to Equation 2. Both of the training and testing phases are described in algorithms (1) and (2) respectively.

IV. RESULTS AND DISCUSSION
We chose 28 datasets of various sizes to assess the proposed MF classifier: small, medium, and Big datasets. Our implementation is based on these numeric datasets, which were retrieved from the UCI website [144]  discrete range, then build the number of bins depending on this range in order to calculate the magnetic force at each position. Therefore, selecting the right number of bins for MF is critical. We conducted multiple pilot tests utilizing the proposed MF classifier with varying number of bins 2, 3, 4 . . . , 15 applied to four (small, medium, and Big) datasets to determine which number of bins is preferable for the MF classifier. Table 3 shows the accuracy of MF for each number of bins.
The number of bins utilized had no effect on Big datasets, as shown in Table 3. This is due to the class imbalance problem, as both Big datasets (SUSY and Poker) are class imbalanced. This is something we'll discuss over later. However, we should highlight that for small and mediumsized datasets (wine and Chinese-minst), the more the number of bins, the higher the accuracy. We should also mention that the accuracy improves from 9 onward. We chose 9 bins for the MF classifier in our experiments as the best option at the present in terms of accuracy and model size, because model size has a considerable impact on memory consumption and testing time.
The proposed MF classifier has been tested and compared to five of the most common classifiers, namely, KNN, SVM, NB, RF, and DT(J48). The comparison focuses on the classification accuracy and time consumed in both the training and testing phases of each classifier. We made these comparisons on the same computer, which has the following specifications: • Intel(R) Core (TM) i7-9750H CPU @ 2.60GHz • 16.0 GB of RAM • 64-bit operating System  • Application used: Weka v 3.8.5 64-bit. Because the performance of the proposed MF is compared to the aforementioned classifiers using WEKA [145], which is constructed using JAVA computer language, we implemented the MF classifier using Java language too so that the results of the comparisons would be well-comparable.
We used 5-fold cross-validation to evaluate the MF classifier in a range of trials, comparing the results to baseline classifiers and assessing the accuracy and time spent in the training and testing phases. Table 4 shows the times spent by all classifiers in both the training and testing phases for all of the 5-fold training/testing experiments. As can be noted from these results, the MF classifier required significantly less time than the other classifiers. However, the time was comparable with small datasets because most classifiers are quick on short datasets, but using baseline classifiers on Big datasets like Poker, SUSY, and HIGGS took over 4 weeks to build and test their models, so we halted their experiments and were unable to add their results to Table. 4.
The evaluated baseline classifiers are inefficient when used on Big data for two reasons: 1) they take an unreasonable amount of time to build and test, and 2) they require a lot of memory, whereas the MF classifier took roughly 10 minutes to build and test its model (5 times since 5-fold) for the Biggest dataset (Higgs).
The  in Table 4 is due to its minimal complexity in time and space.
In terms of classification accuracy, Table 5 displays the accuracy of categorizing 25 datasets (small and medium) using MF and baseline classifiers, leaving the results of Big datasets to be compared to current Big data classifiers, as we were unable to obtain their results owing to time and memory constraints.
Large magnets have larger magnetic fields, and hence a larger attraction force on iron fillings in real life. The same physics applies to the proposed MF classifier, we expect that examples with large numbers will dominate the MF learning process on the account of other examples belonging to minority classes. Because a high number of instances from the same class have a strong magnetic force, hence unclassified examples are incorrectly classified to be belonging to the majority class, even though they are closer to another minority example.
Because the majority of the datasets used in this study are class imbalanced, in order to investigate the impact of class imbalance on our MF classifier, we used the MF classifier on the datasets as-is, then balanced them, and compared the results before and after the balancing process. For simplicity, we balanced all datasets by randomly dropping extra examples from the majority class [146].
As can be observed from the classification accuracy results in Table 5, and as expected, after balancing, the accuracy increased for the majority of datasets. However, after balancing, the accuracy results of some datasets are close to or less than the accuracy before balancing. This is due to the data balancing method we utilized; because we used Random Under-sampling, where instances are discarded at random, these instances may include important information for better learning.
The impact of class imbalance on MF may be examined further by comparing the confusion matrices before and after balancing the data to better understand how the MF classifier behaves while classifying balanced and imbalanced data.
We used four datasets to calculate the confusion matrix before and after balancing just for simplicity. A closer examination of the confusion matrices in Figure 3 offers  further information about the MF model's evaluation; for instance, before balancing, the majority of the examples were classified as belonging to the majority class, but after balancing, the classification results were significantly improved, as shown by the confusion matrices in Figure 3.
Based on this finding, we conclude that the proposed MF classifier has a fundamental limitation when learning from class imbalanced data in its current form.
Previous research has employed a variety of methods to speed up big data classification. We compared the proposed MF classifier's efficiency in terms of accuracy and time spent building and testing the model with the findings of prior studies, which include EPBST, RPBST [66], FPBST [68], NBT, MNBT [67], Iterative MapReduce-based approach for kNN (MR-KNN) [73], Iterative Spark-based design of the kNN classifier (KNN-IS) [74], Randomly partitioned multivariate decision tree (MDT1) and the PCA-partitioned    multivariate decision tree (MDT2) [75]. These methods each published findings on one or more of the three Big datasets D26, D27, and D28, also known as Poker, SUSY, and HIGGS.
Using 5-fold cross-validation, we compared the classification results achieved by the MF classifier to EPBST, RPBST, FPBST, NBT, MNBT on the three Big datasets. All of these methods rely on constructing a BST model during the training phase, which makes the test example search significantly faster than sequential search. The other Big data classifiers are also included in the comparison as shown in Table 6.
It is worth noting that the time consumed in Table 6 is for training and testing phases for some methods and for training only for others, depending on the published results; likewise, the training/testing ratio is not always 5-fold cross-validation; it is holdout with test to train ratios ranging from 19% to 90%.
As shown in Table 6, the MF classifier consumes significantly less time, proving its high speed and efficiency when applied to Big data. The MF Accuracy results are comparable to those of the other Big data classifiers but significantly outperform them all when used on the SUSY Big dataset. Figures 4, 5 and 6, depict the significantly less time consumed by the MF classifier compared to the other Big data classifiers evaluated on the three Big datasets used.
The time comparison was based on time reported in the literature, but we genuinely think it is valid because most VOLUME 10, 2022 of the previous methods compared used machines with high specifications and multiple CPUs, so as to be able to deal with Big data. For example, EPBST, RPBST, FPBST, NBT, and MNBT, all used Azure high-performance computing virtual machine with 16 CPUs and 32 GB RAM, [66], [68], and [67]. While the proposed MF classifier was evaluated on a standard laptop with the modest specifications listed at the opening of this section.

V. CONCLUSION
Inspired by the power of nature, and the work of Hassanat et. al [59], we propose the MF classifier, which calculates the magnetic force at each discrete point in the feature space based on the number of points belonging to a certain class/magnet. The forces measured by various magnets/classes-which are recorded in the trained MF model-are then utilized to classify unknown samples.
We employed 28 small, medium, and big benchmark datasets to evaluate the proposed classifier, and compared the classification results and time consumed by the training and testing phases to a number of popular classifiers and a number of Big data classifiers.
The experimental findings reveal that when compared to the other classifiers, the proposed MF classifier achieves comparable classification accuracy. More importantly, we found that the proposed classifier is significantly faster than all of the other classifiers assessed, especially when applied to Big datasets and hence could be a viable choice for structured Big data classification.
The proposed classifier, however, has two major limitations, according to the results of the experiments: 1) Deciding on the optimal number of bins for the MF model, which is determined by a variety of parameters such as the size of the output model, classification accuracy, and data type. And 2) The current version of the MF model is sensitive to data that is skewed by class. Both of these issues will be addressed in future work.
MALEK ALRASHIDI received the B.Sc. degree in computer science from Taibah University, Saudi Arabia, in 2008, the M.Sc. degree from Newcastle University, U.K., in 2011, and the Ph.D. degree from the School of Intelligence Environment Research Group (IEG) and the Immersive Education Laboratory (iEL). He is currently an Assistant Professor at the University of Tabuk, Saudi Arabia. His main research interests include mixed, augmented, and virtual reality (MR/AR/VR), artificial intelligence (AI) ambient intelligence (AmI), the Internet-of-Things (IoT), cyber-physical systems (CPS), intelligent environments, computer-support collaborative work (CSCW), technology-enhanced learning (TEL), and human-computer interaction (HCI).
MANSOOR ALGHAMDI received the M.Sc. degree from the University of Otago, New Zealand, in 2011, and the Ph.D. degree from the School of Computer Science, Bangor University, U.K. He is currently an Assistant Professor with the Department of Computer Science, University of Tabuk. His main research interests include natural language processing, machine learning, text analytics in English and Arabic, virtual reality, and psychological aspects of mixed reality.
GHADA AWAD ALTARAWNEH received the Ph.D. degree in accounting from the University of Buckingham, U.K., in 2011. She is currently an Associate Professor with Mutah University. Her main interests include managerial accounting, auditing, and business intelligence.
MOHAMMAD ALI ABBADI received the Ph.D. degree in computer science from George Washington University, in 2000. He is currently a Professor with the Department of Computer Science, Faculty of Information Technology, Mutah University. His research interests include multimedia, networks, blended and e-learning, data compression, and image processing.