Developing Novel Activation Functions Based Deep Learning LSTM for Classification

This study proposes novel Long Short-Term Memory (LSTM)-based classifiers through developing the internal structure of LSTM neural networks using 26 state activation functions as alternatives to the traditional hyperbolic tangent (tanh) activation function. The LSTM networks have high performance in solving the vanishing gradient problem that is observed in recurrent neural networks. Performance investigations were carried out utilizing three distinct deep learning optimization algorithms to evaluate the efficiency of the proposed state activation functions-based LSTM classifiers for two different classification tasks. The simulation results demonstrate that the proposed classifiers that use the Modified Elliott, Softsign, Sech, Gaussian, Bitanh1, Bitanh2 and Wave as state activation functions trump the tanh-based LSTM classifiers in terms of classification accuracy. The proposed classifiers are encouraged to be utilized and tested for other classification tasks.

Hochreiter and Schmidhuber proposed the long short-term 30 memory network (LSTM), which is a recurrent neural net-31 work (RNN) architecture that has been demonstrated to be 32 successful for various learning problems, particularly those 33 requiring sequential data [4]. The LSTM architecture consists 34 of blocks, which are a combination of recurrently connected 35 units [5]. The vanishing gradient problem occurs when the 36 gradient of an RNN's error function increases or decreases 37 exponentially over time. The development of new LSTM 38 techniques, structures, and activation functions improves con-39 vergence to greater accuracy during deeper network training, 40 overcoming the vanishing/exploding gradient problem [6]. 41 LSTM has become popular in a variety of applications in 42 recent years [7]. 43 Each memory unit replaces a neuron in the LSTM network. 44 An actual neuron with a recurrent self-connection is included 45 in the unit. The gate activation function (sigmoid) and the 46 state activation function (tanh) are the two most common 47 activation functions for those neurons in memory units [8]. 48 The hyperbolic activation function (tanh) is the state activa-49 tion function of LSTM networks, which is used to determine 50 candidate cell state (internal state) values and update the 51 works with different structures are compared more specific. 89 The results demonstrate that the most frequently utilized   In previous research [5] and [17] a comparison study was car-112 ried out in which the performance of an LSTM network was 113 evaluated when different activation functions were switched. 114 This study compared the results of the network when differ-115 ent activation functions were used. Both of these pieces of 116 research arrived to the same conclusion: the switching activa-117 tion functions have an effect on the way the network operates. 118 Although the sigmoid function, which is the typical activation 119 function in sigmoidal gates, gives remarkable performance, 120 it has been discovered that other, less-recognized activation 121 functions can provide more accurate performance. These 122 alternative activation functions have been studied. In addi-123 tion, in [5] they compared exactly 23 different activation 124 functions, in which the three gates (the input, output, and 125 forget gate) changed activation functions while the block 126 input and block output activation functions were held con-127 stant with the hyperbolic tangent. This was done so that the 128 activation functions of the block could be compared(tanh). 129 The study's authors recommended altering the hyperbolic 130 tangent function on the block input and block output as a 131 better alternative to altering the activation functions in the 132 three gates by the authors. In addition, the authors suggest 133 that additional research be done on other components of an 134 LSTM network. One example of this is the effect that this 135 modification would have. 136 Elsayed et al. [33] described how different activation func-137 tions have been applied to more complicated LSTM-based 138 neural networks in different areas rather than recommenda-139 tion systems in order to improve performance. The activation 140 functions of LSTM blocks have been investigated in detail by 141 Elsayed [33]. 142 Song and Brogärd et al.
[9] they tested the performance of 143 four distinct activation functions in LSTM neural networks 144 to see which one was the most effective (hyperbolic tangent, 145 sigmoid, ELU and SELU activation functions). They showed 146 that the tangent and sigmoid functions were much better 147 than the ELU and SELU at making predictions for movie 148 recommendation systems. 149 Burhani et al. [22] obtained a similar conclusion in their 150 study on denoising auto encoders, namely that the modified 151 Elliott activation function had better performance and smaller 152 error than the log-sigmoid activation function. Furthermore, 153 in the first set of studies, we discovered that Cloglogm pro-154 vided the best activation, which is similar to the findings of 155 Gomes et al. [17].

157
The following is a summary of the information presented in 158 this paper. Section II provides the LSTM architecture and the 159 activation functions. Section III presents the methodology. 160 Section IV. Section V shows the conclusion of this paper.

164
In the next sections, we will talk briefly about the LSTM 165 architecture and the activation functions used in the network.  In each block, the elements are determined by the equations 1 176 through 6.   state contains information that is transferred back and forth 212 between each LSTM block the output of a cell is referred 213 to as the hidden state in more explicit terms. Hidden state 214 is represented in Figure 2 by the output of the cell together 215 with the point wise operation from the output gate. Thanks 216 to the use of controlled structures known as gates, the LSTM 217 has the capability of removing or adding information to the 218 cell state and concealed state. They are made up of a sigmoid 219 neural network layer and a point wise multiplication opera-220 tion, among other things. The sigmoid layer, represented by 221 the round circle in the illustration, generates integers rang-222 ing from zero to one. Amount of information that will pass 223 through the gate is represented by the numbers [19].

225
An activation function is a function that is introduced to an 226 artificial neural network to assist the network in learning 227 complex patterns in the data and to have the capacity to 228 introduce non-linearity into a neural network without the use 229 of programming. When compared to the neuron-based model 230 found in our brains, the activation function is found at the 231 VOLUME 10, 2022   activation function is used in an ANN. In this cell, the output 234 signal from the previous cell is received and converted into a 235 form that can be used as an input signal for the next cell.

236
A poor selection of activation functions can result in the 237 loss of input data as well as vanishing or exploding gradi-238 ents in the neural network. Neural networks have three key 239 components that influence their performance: the network 240 architecture and the pattern of connections between units, 241 the learning algorithm, and the activation functions that are 242 utilized in the network. Each of these aspects has a signif-243 icant impact on network performance [13]. The majority of 244 neural network research has concentrated on the value of the 245 learning algorithm, whereas the importance of the activation 246  Table 1. Also, we compare the impact The sigmoid function has the formula is given by [21].
According to Table 1, we have produced a comprehensive 260 list of 26 such functions that will be described further below. 261 We observed experimentally that by increasing the value of 262 some functions by a factor of 0.5, they become usable as acti-263 vation functions in the network. The alteration of the range 264 of activation functions has been seen in various previous 265 studies [22]. In Table 1, the first activation function is the 266 wave function proposed by Hara and Nakayamma.
[23]. The 267 second is Softsign function proposed by [24], Aranda-Ordaz 268 introduced by Gomes et al which is labeled as Aranda [16]. 269 Fourth to seventh functions are the bimodal activation func-270 tions proposed by Singh et al and labeled as Bisig1, Bi-sig2, 271 Bi-tanh1, and Bi-tanh2, respectively. [25].The next function 272 presents a modified version of Cloglog, and Cloglogm [17]. 273 Next come the Elliott, Gaussian, logarithmic, The13th func-274 tion is the complementary log-log [26].    biases adjusted to minimize the loss function. Learning of 307 deep neural networks can be described as an optimization 308 problem that seeks to find a global optimum through a reli-309 able training trajectory and fast convergence using gradient 310 descent algorithms [19]. Choosing the optimal optimization 311 approach for a specific scientific problem acts as a serious 312 challenge. Choosing an inappropriate optimization approach 313 may lead the network to reside in the local minima during 314 training, and this does not achieve any advances in the learn-315 ing process. Hence, the investigation is necessary to analyze 316 the performance of different optimizers depending on the 317 dataset employed for obtaining the best LSTM-based classi-318 fiers for the proposed ones.   In this study, we employed data sets from the Japanese Vowels 357 dataset for the first set of trials. The original Japanese Vowels 358 (Vowels) dataset from the University of California, Irvine 359 machine learning repository is a multivariate time series 360 data in which nine male speakers pronounced two Japanese 361 vowels (ae) in succession. A 12-degree linear prediction 362 analysis (Sampling rate: 10kHz, Frame length: 25.6ms, Shift 363 length: 6.4ms) was performed to obtain a discrete-time series 364 with 12 LPC cepstrum coefficients (Sampling rate: 10kHz, 365 Frame length: 25.6ms, Shift length: 6.4ms). In other words, 366 each utterance made by the speaker results in the formation 367 VOLUME 10, 2022  Table 3 and Table 4 Table 3, activation function-based LSTM classifiers 391 can achieve the highest accuracy using 100 hidden neurons 392 rather than 20 or 50. In total, 19 LSTM-based classifiers 393 perform accurate classification with an accuracy in the range 394 of 90-97.5676% at 100 hidden neurons, in addition to the 395 tanh-based LSTM classifier, which achieves an accuracy of 396 93.2432%. Tabulated results demonstrate that 12 of the 19 397 proposed LSTM-based classifiers outperform the tanh-based 398 LSTM classifier, and the best of all is the wave-based LSTM 399 classifier with 97.5676% accuracy. Figure 3 displays the 400 accuracy and loss curves obtained from the learning processes 401 of the conventional tanh-based LSTM classifier and the pro-402 posed wave-based LSTM classifier with the highest accuracy. 403 Table 4 lists the accuracy percentages for all examined 404 classifiers under the condition of using a hard-sigmoid 405 gate activation function in place of the sigmoid function. 406 21 LSTM-based classifiers perform accurate classification 407 with accuracy in the range of 92 -97.0270% at 100 hidden 408   Table 5 and Table 6 Table 5 shows that activation function-based LSTM clas-   Table 7 and Table 8

functions-based LSTM classifiers, that use the Sigmoid and 503
Hard-sigmoid gate activation functions, respectively, and are 504 trained by employing Adam, RMSprop, and SGDM optimiz-505 ers, and 100 hidden unit structures.

506
By employing the Adam optimizer, it is obvious that the 507 wave-based LSTM classifier beats the tanh-based LSTM 508 classifier by achieving a correct classification accuracy of 509 97.5676%, where the latter achieved 93.4054%. Also, the 510 wave-based LSTM classifier is the best among the proposed 511 classifiers. Using the RMSProp optimizer, the wave-based 512 LSTM classifier outperforms the tanh-based LSTM classi-513 fier, reaching 96.4865% accurate classification accuracy vs 514 93.4054 % for the latter. Moreover, among the suggested 515 classifiers, the wave-based LSTM classifier is the best.

516
By using the SGDM optimizer, the Modified Elliott-based 517 LSTM classifier trumps the tanh-based LSTM classifier by 518 attaining 95.9459 percent accurate classification accuracy, 519 vs 94.3649 percent. Fig. 10 shows that the Modified Elliott-520 based LSTM classifier is the best.   Table 10 and Table 11 list the true classification accuracies 546 percentages for each activation functions-based LSTM clas-547 sifier for Weather Reports Classification using optimization 548 algorithm (Adam), sigmoid, and hard-sigmoid gate activation 549 functions respectively. All the training data is exposed to 550 the classifier in mini-batches at each epoch. Where tanh is 551 the default state activation function in the LSTM structure, the 552 tanh-based LSTM classifiers' achieved accuracies are taken 553 as reference for comparison.

554
From Table 10, activation function-based LSTM classi-555 fiers can achieve the highest accuracy using 100 hidden 556 neurons rather than 20 or 50. 19 LSTM-based classifiers 557 perform accurate classification with accuracy in the range of 558 84-88.04% at 100 hidden neurons, in addition to the tanh-559 based LSTM classifier, which achieves an accuracy of 86.1%. 560   Fig. 12 shows the accuracy and loss curves obtained from 580 the learning processes of the conventional tanh-based LSTM 581 classifier and the proposed Gaussian-based LSTM classifier 582 with the highest accuracy. The overall performance of the 583 proposed state activation functions-based LSTM classifiers 584 with a hard-sigmoid gate activation function is better than 585 those are using the sigmoid gate activation function. 586 Table 12 and Table 13 list the true classification accu-587 racy percentages for each activation function-based LSTM 588 classifier for Weather Reports Classification using opti-589 mization algorithm (RMSprop), sigmoid and hard-sigmoid 590 VOLUME 10, 2022  gate activation function respectively. All the training data 591 is exposed to the classifier in mini-batches at each epoch.    Table 13 lists the accuracy percentages for all exam-   curves obtained from the learning processes of the conven-621 tional tanh-based LSTM classifier and the proposed Modi-622 fied Elliott-based LSTM classifier with the highest accuracy. 623 The overall performance of the proposed state activation 624 function-based LSTM classifiers with a hard-sigmoid gate 625 activation function is better than those using the sigmoid gate 626 activation function. 627 Table 14 and Table 15 list the true classification accuracies 628 percentages for each activation functions-based LSTM clas-629 sifier for Weather Reports Classification using optimization 630 algorithm (SGDM), sigmoid and hard-sigmoid gate activa-631 tion functions respectively. All the training data is exposed 632 to the classifier in mini-batches at each epoch. Where tanh is 633 the default state activation function in the LSTM structure, the 634 tanh-based LSTM classifiers' achieved accuracies are taken 635 as reference for comparison. From Table 14 and Table 15, all 636 activation function-based LSTM classifiers can achieve weak 637 results compared to other optimization algorithms (Adam, 638 RMSprop) in all different hidden neurons.

639
As shown in Figure 15, by using the Adam optimizer, 640 it is obvious that the Softsign-based LSTM classifier beats 641 the tanh-based LSTM classifier by achieving a correct clas-642 sification accuracy of 88.048%, where the latter achieved 643 86.1925%. Also, the Softsign-based LSTM classifier is the 644 best among the proposed classifiers.