Multi-Disease Classification Model Using Strassen’s Half of Threshold (SHoT) Training Algorithm in Healthcare Sector

In healthcare industry, Neural Network has attained a milestone in solving many real-life classification problems varies from very simple to complex and from linear to non-linear. To improve the training process by reducing the training time, Adaptive Skipping Training algorithm named as Half of Threshold (HOT) has been proposed. To perform the fast classification and also to improve the computational efficiency such as accuracy, error rate, etc., the highlighted characteristics of proposed HOT algorithm has been integrated with Strassen’s matrix multiplication algorithm and derived a novel, hybrid and computationally efficient algorithm for training and validating the neural network named as Strassen’s Half of Threshold (SHoT) Training Algorithm. The experimental outcome based on the simulation demonstrated that the proposed SHOT algorithm outperforms both BPN and HOT algorithm in terms of training time which is reduced with the range of 7% to 54% and its efficiency which is improved with the range of 3% to 15% on various dataset such as Hepatitis, SPeCT, Heart, Liver Disorders, Breast Cancer Wisconsin (Diagnostic), Drug Consumption, Cardiotocography, Splice-junction Gene Sequences and Thyroid Disease dataset that are extracted from Machine Learning Dataset Repository of UCI. It can be integrated with any type of supervised training algorithm.


I. INTRODUCTION
Every second, the amount of healthcare data that is being generated by the healthcare industry is growing exponentially (approximately 30% of the world's data volume) [1]. This growth rate will reach 36% annually by 2025. At the same time, it is mined to extract the valuable information. Many of today's creative applications with big scale datasets challenge the natural intelligence of the human brain, which is the most intelligent system on the planet, due to exponential growth in many scientific and medical sectors [2]. Learning The associate editor coordinating the review of this manuscript and approving it for publication was Mauro Gaggero . new patterns in large scale datasets manually, in a fast and intelligent manner, is beyond the capacity or patience of any human being. To address this issue, researchers developed the concept of Neural Network (NN). Since 1943, Neural Network has attained a milestone in solving many real-life classification problems varies from very simple to complex and from linear to non-linear [3]. When it is viewed from the technical / implementation aspect, training the neural network on very large datasets with the traditional back-propagation algorithm is still facing many challenges. One of the biggest challenges that are faced by neural network is the training rate. The elements that leverage the Neural Network's training rate are Network structure [4], Training dataset size, computational efficiency, and problem to classify [5]. The elements listed above relate to each other.
Based on the problem that has been considered for classification, the datasets will be generated and utilized. Very large numbers of training datasets must be fed into the neural network for training to increase the efficiency of the training algorithm as well as to generalize the network. The network structure can, however, expand automatically for a larger training dataset, leading to increased training time as well as reduced efficiency. In practical terms, a larger training dataset usually requires a very long training time with more epochs that leverage the training speed. The Half of Threshold (HoT) Adaptive Skipping Training Algorithm is applied to increase the training speed by minimizing Neural Network training time by randomly presenting the samples in training datasets to boost the training performance. Also, as the size of the network structure increases, it contributes to increase in the weight matrix size.
Among the operation that takes place during the training of neural networks using back-propagation algorithm, Matrix multiplication is the most highly computational process. For making matrix multiplication faster, Strassen's algorithm [6] is prescribed to multiply the matrices, which is shown in the Theorem 1. By combing the highlighted characteristics of Adaptive Skipping Training Algorithm and Strassen's algorithm, the overall training time consumed by the neural network will be reduced much with leads to increase in efficiency. By integrating the highlighted characteristics of Half of Threshold (HOT) Adaptive Skipping Training algorithm with Strassen's algorithm, a novel, hybrid, and computationally efficient algorithm called Strassen's Half of Threshold (SHoT) Adaptive Skipping Training Algorithm, for training the neural network, has been proposed. Because of this proposed algorithm, the cumulative training time consumed by the neural network will be significantly increased resulting in better training performance.

II. RELATED STUDY
Many researchers have contributed many works towards improving the performance of training algorithm by increasing the training speed, improving the accuracy / decreasing the error rate, etc., in different enhancement: estimation of initial weight optimally, second order algorithm for faster learning and maintaining generalization and adaptive learning rate and momentum which has been surveyed in this session. Proper initialization of NN initial weights in the training algorithm's beginning point minimizes the number of iterations in the training process, resulting in faster training. Initial weights have been demonstrated to affect the BPN technique [7]. In most cases, modest random numbers are chosen as the NN's initial weights. Nguyen and Widrow [4] assign a fraction of the intended response range to each hidden node, and Drago and Ridella [8] utilize a technique called statistically controlled activation weight initialization (SCAWI) that calculates the maximum value that the weights should adopt at first to avoid neurons becoming saturated throughout the adaptation process. Some studies recommended for utilizing a probability distribution of the mean squared error [9] and the DPT (Delta Pre-Training) approach [10] uses different sets of small initial weights for initializing and training the NN for several times and also if the weight space is well-conditioned, then DPT is a decent concept. If the best of this group fails to meet the requirements, the process is restarted. Many people support this method, although it is essentially a trial-anderror approach with no mathematical foundation. Premature Saturation, for example, can be caused by initial weight values that are excessively large. As a result, the ASCE task committee advises that random values between −0.30 and +0.30 should be assigned for weights and thresholds as a starting point. [11].
For Single Hidden Layer Feedforward Neural Networks (SLFNs), a technique called Extreme Learning Machine (ELM), which is new and fast in learning, was published in 2004 [12], [13] that selects the weights randomly for input and derives the output weights for output analytically. Sensitivity analysis was employed in the development of the novel initialization strategy for neural networks [14], [15]. The outputs of the first layer are first assigned random values. Once the original values have been modified using sensitivity formulae, the weights are then determined using linear equations. The main benefits of this method can obtain a good solution in only one epoch and with minimal time for computation. Starting with erroneous weight values, on the other hand, can trap the network in local minima or limit learning progress. To speed up the learning process, the initial weights were carefully chosen.
Previously, the momentum-coefficient was usually treated as a constant between 0 and 1. However, the results of the experiments revealed that the fixed coefficient value for the momentum appears to speed up learning only when the recent error function's downward gradient and the latest weight change are in the same direction. When the newest negative gradient crosses the prior update, the momentum coefficient causes the weight modification to be projected up rather than down in the slope of the error surface, as is suggested [16]. To make learning more successful, it is critical to change the momentum coefficient value adaptively rather than keeping it constant throughout the training period. Even though the error function is not considered quadratic, Zhang et al claim that the BPN approach's output is converged with constant learning rate and adaptive momentum. [17]. Both strong and weak convergence results are confirmed as well as it can escape local minima and so accelerating network training. The error gradient is closed to zero as the training enters the smooth area, causing the network to converge slowly.
The learning rate is constant and uniform across all weights in a layer for the BPN algorithm [18]. The values assigned for parameter will fluctuate around the minima of the performance surface as gradient descent approaches minima. The network's parameter is changed in a fixed manner while the learning rate is constant, resulting in sluggish convergence VOLUME 9, 2021 to the goal error [19]. Slowing down parameter updates by allowing the learning rate to fluctuate adaptively is one way to avoid this. This will allow the network to make better responses after each weight update. The essential concept behind adaptive learning rate is that if performance falls short of the error objective at each epoch, the learning rate is increased by a constant value. Another constant parameter reduces the learning rate as performance improves. Several dynamic approaches for adaptively assigning the learning rate have been defined, based on the factor inclined to examine. Learning techniques based on the Lyapunov stability theory have been suggested for NNs [20]. The structure of Lyapunov Function-based learning algorithm (LF I) and its modified variant (LF II) are same as that of BPN method, with the exception that the suggested algorithms substitute the fixed learning rate with an adaptable learning rate. The gradient in error is closed to zero when the training reaches the smooth area. The adaptive learning rate will then be high, and weight adjustment will be delayed, resulting in slow convergence to the goal error.
Following that, the algorithm for changing the weight during the training phase has been provided, which derives the second order differential equation from the cost functions. The quasi-Newton methods or Levenberg-Marquardt (LM) algorithms are the most often used second order training algorithms [21], [22] and Conjugate Gradient (CG) methods [23]. To perform the fast classification and to improve the accuracy, a new training algorithm is proposed named as GA-BEL that combines Genetic Algorithm (GA) and brain-inspired emotional learning (BEL) algorithm [24]. Based on the optimization technique Particle Swarm Optimization (PSO), proposed a model for learning named as PSO-FLN is proposed by M.H.Ali et.al for Fast Learning Network(FLN) [25] that has been experimented with the intrusion detection system dataset KDD99 which outperforms well in all aspect. Using the Extreme Learning Machine (ELM) as a baseline, an algorithm to perform fast learning is applied on the RFNN (Regular Fuzzy Neural Network) is proposed [26] and a new fast learning method (FLM) has been presented for feedforward neural networks [27]. Next, based on the concept of Adaptive Skipping, a new and fast training approach for ANN (Artificial Neural Network) is instituted by presenting the input samples for training randomly [3], [5] and based on the fuzzy system [28]. To train SLFN (single hidden layer feedforward neural network) and to optimize the weight of SLFN, an algorithm is proposed by that hybridize the self-organizing map (SOM) algorithm with ELM algorithm [29]. CGP-based Artificial Neural Network (CGPANN), based on the Cartesian genetic programming (CGP) technique, is a fast-learning neuroevolutionary algorithm applicable for both feedforward and recurrent networks [30]. Even though the following methods produce good outcomes, they are computationally intensive. Convergence slowly has been identified as a serious issue for all learning methods of BPN. An innovative, hybrid, and computationally efficient neural network training algorithm named as Strassen's Half of Threshold (SHoT) Adaptive Skipping Training Algorithm has been presented to improve the training speed of BPN.

III. PROPOSED SHOT ALGORITHM
To conduct the research study effectively, a three-layer feedforward neural network with multiple layers has been suggested. The proposed neural network structure's layout is built with N neurons as input, P neurons as hidden and O neurons as output. Since the prescribed neural network is fully interconnected, neurons present in the preceding layer are linked with each neuron in the next layer. The input layer has the same number of neurons as the training dataset's properties.

A. BASIC NOTATION
Each sample is partitioned into a feature vector, X, and a target class, Y, given a training dataset with M labelled training samples. Let X ∈ R M ×N be the 2D matrix of size M × N which is populated with the data samples from the training dataset that contains M input samples and N number of attributes.
Since the training dataset consumed in this research is a supervised dataset, the corresponding target class label, represented as T ∈ R M , for the above M input samples is shown below.
be the 2D matrix of size N × P that holds the input-to-hidden synaptic weight coefficient that is assigned for each connection link established between N input neurons to P hidden neurons.
Let − → v o represents bias vector of size 1 × P that is fed into the nodes in the hidden layer.
Let W ∈ R P×O be the 2D matrix of size P × O that holds the hidden-to-output synaptic weight coefficient that is assigned for each connection link established between Let − → w o represents bias vector of size 1 × O that is fed into the nodes in the output layer.
Let ϕ h (x) and ϕ o (x) represent the nonlinear sigmoid and linear activation function adopted to compute the net output in the hidden and output layer, respectively. The symbol used for representing iteration number is t. Let sf i and sv i be the skipping factor and skipping value of the i th samples in the training dataset. Let d max be the error threshold value. Let ic be the number of iteration / epoch count.

B. WORKING PRINCIPLE OF SHOT
Step 1 (Initialization Phase): Initialize the following parameters of the constructed neural network. Step 2 (Terminating Condition): Check whether the terminating criterion is attained or not. If it is not attained, repeat the step 3-8. Otherwise, go to step 9.
Step 3: Repeat the step 4-8 for M (number of training samples in the training dataset, X ) times, 1 ≤ k ≤ M Step 4 (Present the Training Sample): Training dataset samples are distributed to the input layer in the network which will just propagate it without any computation.
Step 5 (Forward Propagation): The following steps are calculated till the output layer starting from the hidden layer through the propagation process: Step 5.1: For the Hidden layer, compute the activation values as: Apply Theorem 1 to estimate the net output value using Strassen's Fast Multiplication of Matrices Algorithm which is specified in Algorithm 1.
Estimate the actual output.
Step 5.2: For the Output layer, compute the activation values as: Apply Theorem 1 to estimate the net output value using Strassen's Fast Multiplication of Matrices Algorithm Estimate the actual output Theorem 1 (Strassen's Theorem): Two N × N matrices can be multiplied using only N 7 ≈ N 2.8074... scalar multiplications.
Step 6: Error Signal Calculation Using the squared error function, the error signal for each output neuron is calculated and performs summation over the error signal to get the total error: Step 6.1: Finding the Error derivative for hidden to output weight Adjust the network's weights by calculating the partial error derivative with respect to the weight to minimize the error, E, globally.
Expand the above error function using chain rule The derivation of the error with respect to the activation function is derived here The derivation of the activation function with respect to the net input is shown here Rewriting the above equation, The derivation of the net input with respect to the synaptic weight is shown here Substituting the value of each derivative, Step 6.2: Finding the Error derivative for input to hidden weight The derivation of the activation function with respect to the net input is shown here Rewriting the above equation, The derivation of the net input with respect to the synaptic weight is shown here Step 7: Backward Propagation Step 7.1: Update weights hidden to output weight using the Delta-Learning Rule Step 7.2: Update weights input to hidden weight using the Delta-Learning Rule Step 8: AST Algorithm Step 8.1: Calculate the difference between the neural network output's target (t k ) and actual(y k ) value, t k − y k .
Step 8.2: Compare the difference (step 8.1) with the error half of the threshold value. Whenever the samples are classified correctly, the following condition returns zero.
Step 8.3: Determine the computing probability of all the input samples based on t k − y k .
Step 8.4: If the prob (x i ) is 0, then the corresponding input samples have been classified correctly and it will be skipped from training for next sf i epochs.
Step 8.5: If the prob (x i ) is 0, then skipping value sv i is increased by sf i . When the skipping value sv i becomes zero, then the i th input samples will be presented again for training. Step 8.6: Based on the skipping value sv i , the modified training dataset is constructed which will be presented in the next epoch.
Step 9: Stop the training process A machine learning algorithm must be tested on test data once it has learned the fundamental patterns in the training data. It is termed an efficient machine learning classifier model if it performs well on the test data and generalizes the produced dataset, which is measured using the classifier's performance matrices.

IV. SIMULATION-BASED EXPERIMENTAL RESULT AND ITS ANALYSIS
To conduct the research successfully, the proposed supervised machine learning SHOT algorithm is simulated with the following machine configurations: Intel R Core I5 generation-3210M processor, CPU speed with 2.50GHz and MATLAB Software R2010b version.

A. DATASET DESCRIPTION
To assess the performance of the existing and proposed SHOT algorithms, both the algorithms were tested on datasets acquired from UCI's Machine Learning Dataset Repository for binary and multi-class classification problems [14]. The data collection for the Hepatitis dataset is loaded with 155 samples of data collected containing 19 attributes and 2 classes of binary classification. The SPeCT Heart dataset has 267 samples in its data collection with 22 attributes and 2 classes of binary classification. The data collection for the Liver Disorders dataset is loaded with 345 samples of data collected containing 7 attributes and 2 classes of binary classification. The data collection for the Breast Cancer Wisconsin (Diagnostic) dataset is loaded with 569 samples of data collected containing 32 attributes and 2 classes of binary classification. The data collection for the Drug Consumption dataset is loaded with 1885 samples of data collected containing 32 attributes and 7 classes of multi-class classification. The data collection for the Cardiotocography dataset is loaded with 2126 samples of data collected containing 23 attributes and 3 classes of multi-class classification. The Splice-junction Gene Sequences dataset has 3190 samples in its data collection with 19 attributes and 3 classes of multi-class classification. The data collection for the Thyroid Disease dataset is loaded with 7200 samples of data collected containing 19 attributes and 3 classes of multi-class classification. The training dataset properties are shown in Table 1.

B. EXPERIMENTAL SETUP AND RESULT
To perform the experiment, the supervised machine learning algorithm is simulated with the use of 3-layer multilayer feedforward neural network. For enhancing the training performance by attaining more accurate prediction, the ten-fold cross validation technique is adapted for training the network model in which all the training samples are given for training.
The performance of the proposed SHoT method is evaluated using four real benchmark classification datasets snatched from the UCI Machine Learning Repository: Hepatitis, SPeCT, Heart, Liver Disorders, Breast Cancer Wisconsin (Diagnostic), Drug Consumption, Cardiotocography, Splice-junction Gene Sequences, and Thyroid Disease. Training time and Accuracy have been used as performance indicators to assess the performance of various supervised machine learning algorithms. Training time refers to the amount of time spent by the classifier throughout the training process. The percentage of correctly categorized samples is used to define accuracy.

Number of samples that are classified correctly
Total number of samples

1) HEPATITIS DATASET
The results of various learning algorithms for training Hepatitis dataset are summarized in Table 2 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural Networks utilizing BPN algorithm, the average time for the complete training process consumed by HoT and SHoT methods is reduced by 24% and 37%, respectively, and by SHoT technique is reduced by 17% when compared to HoT algorithm.

2) SPECT HEART DATASET
The results of various learning algorithms for training SPeCT Heart dataset are summarized in Table 3 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural Networks utilizing BPN algorithm, the average time for the complete training process consumed by HoT and SHoT methods is reduced by 10% and 46%, respectively, and by SHoT technique is reduced by 39% when compared to HoT algorithm.

3) LIVER DISORDERS DATASET
The results of various learning algorithms for training Liver Disorders dataset are summarized in Table 4 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural Networks utilizing BPN algorithm, the average time for the complete training process consumed by HoT and SHoT methods is reduced by 9% and 28%, respectively, and by SHoT technique is reduced by 20% when compared to HoT algorithm.

4) BREAST CANCER WISCONSIN (DIAGNOSTIC) DATASET
The results of various learning algorithms for training Breast Cancer Wisconsin (Diagnostic) dataset are summarized in Table 5 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural Networks utilizing BPN algorithm, the average time for the complete training process consumed by HoT and SHoT methods is reduced by 38% and 54%, respectively, and by SHoT technique is reduced by 25% when compared to HoT algorithm.

5) DRUG CONSUMPTION DATASET
The results of various learning algorithms for training Drug Consumption dataset are summarized in Table 6 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural VOLUME 9, 2021  Networks utilizing BPN algorithm, the average time for the complete training process consumed by HoT and SHoT methods is reduced by 7% and 23%, respectively, and by SHoT technique is reduced by 17% when compared to HoT algorithm.

6) CARDIOTOCOGRAPHY DATASET
The results of various learning algorithms for training Cardiotocography dataset are summarized in Table 7 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural Networks utilizing BPN algorithm, the average time for the complete training process consumed by HoT and SHoT methods is reduced by 7% and 20%, respectively, and by SHoT technique is reduced by 13% when compared to HoT algorithm.

7) SPLICE-JUNCTION GENE SEQUENCES DATASET
The results of various learning algorithms for training Splice-junction Gene Sequences dataset are summarized  in Table 8 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural Networks utilizing BPN algorithm, the average time for the complete training process consumed by HoT and SHoT methods is reduced by 9% and 26%, respectively, and by SHoT technique is reduced by 18% when compared to HoT algorithm.

8) THYROID DISEASE DATASET
The results of various learning algorithms for training Splice-junction Gene Sequences dataset are summarized in Table 9 for each fold of tenfold cross validation and are compared to the proposed SHOT approaches. In comparison to existing methods, the accuracy produced by the suggested SHOT methods has improved. Furthermore, when compared to Artificial Neural Networks utilizing BPN algorithm, the average time for the complete training process consumed  by HoT and SHoT methods is reduced by 39% and 54%, respectively, and by SHoT technique is reduced by 24% when compared to HoT algorithm.

1) TRAINING TIME
The training time consumed totally by various training algorithm such as BPN, HoT and SHoT at the end of each training fold and its average training time of all the training fold is compared and represented in Figure 2.

2) ACCURACY
The comparison result of the accuracy consumed by various training algorithm such as BPN, LAST and SHOT is illustrated in the Figure 3. From Figure 3 and 4, For the Hepatitis dataset, the accuracy obtained by SHoT training algorithm is 15% greater than that acquired by BPN training algorithm and 8% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 8% higher than that obtained by BPN. For the SPeCT Heart dataset, the accuracy obtained by SHoT training algorithm is 12% greater than that acquired by BPN training algorithm and 5% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 7% higher than that obtained by BPN. For the Liver Disorders dataset, the accuracy obtained by SHoT training algorithm is 8% greater than that acquired by BPN training algorithm and 5% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 3% higher than that obtained by BPN. For the Breast Cancer Wisconsin (Diagnostic) dataset, the accuracy obtained by SHoT training algorithm is 12% greater than that acquired by BPN training algorithm and 5% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 7% higher than that obtained by BPN. For the Drug Consumption dataset, the accuracy obtained by SHoT training algorithm is 8% greater than that acquired by BPN training algorithm and 3% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 5% higher than that obtained by BPN. For the Cardiotocography dataset, the accuracy obtained by SHoT training algorithm is 10% greater than that acquired by BPN training algorithm and 4% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 6% higher than that obtained by BPN. For the Splice-junction Gene Sequences dataset, the accuracy obtained by SHoT training algorithm is 8% greater than that acquired by BPN training algorithm and 5% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 3% higher than that obtained by BPN. For the Thyroid Disease dataset, the accuracy obtained by SHoT training algorithm is 7% greater than that acquired by BPN training algorithm and 5% higher than that produced by HoT training algorithm, and the accuracy obtained by HoT training algorithm is 2% higher than that obtained by BPN.

V. CONCLUSION
The experimental outcome based on the simulation demonstrated that the proposed SHOT algorithm outperforms both HoT and BPN algorithm in terms of training time and its efficiency. Regarding training time, the proposed SHOT algorithm decreases the total training time it takes to train the network, which in turn increases the training speed. In comparison to its current supervised algorithm, such as HoT and BPN, the accuracy obtained by the proposed SHOT methods has been improved. Finally, the proposed SHOT approach increases the training performance for any kind of real-world supervised classification task by both training speed and by accuracy compared to the current algorithm. Also, the proposed SHOT algorithm also provides quicker convergence and results in lower values of RMSE compared to the HoT algorithm and standard BP algorithm. The current research can be extended in different ways to originate new learning algorithms such as incorporating Adaptive Skipping Training algorithm variants, applying the optimization technique, injecting the Fuzzy logic, and so on. For any NN application, the proposed training algorithm can be applied.