TanhSoft—Dynamic Trainable Activation Functions for Faster Learning and Better Performance

Deep learning, at its core, contains functions that are the composition of a linear transformation with a nonlinear function known as the activation function. In the past few years, there is an increasing interest in the construction of novel activation functions resulting in better learning. In this work, we propose three novel activation functions with learnable parameters, namely TanhSoft-1, TanhSoft-2, and TanhSoft-3, which are shown to outperform several well-known activation functions. For instance, replacing ReLU with TanhSoft-1, TanhSoft-2, and Tanhsot-3 improves top-1 classification accuracy by 6.06%, 5.75%, and 5.38% respectively on VGG-16(with batch-normalization), by 3.02%, 3.25% and 2.93% respectively on PreActResNet-34 in CIFAR-100 dataset, by 1.76%, 1.93%, and 1.82% respectively on WideResNet 28-10 in Tiny ImageNet dataset. TanhSoft-1, TanhSoft-2, and Tanhsot-3 outperformed ReLU on mean average precision (mAP) by 0.7%, 0.8%, and 0.6% respectively in object detection problem on SSD 300 model in Pascal VOC dataset.


I. INTRODUCTION
Artificial neural networks (ANNs) have occupied the center stage in deep learning in the recent past. ANNs are made up of several hidden layers, and each hidden layer consists of several neurons. At each neuron, an affine linear map is composed with a nonlinear function known as activation function. During the training of an ANN, the linear map is optimized; however, an activation function is usually fixed in the beginning, along with the architecture of the ANN. There has been an increasing interest in developing a methodical understanding of activation functions, particularly with regards to the construction of novel activation functions and identifying mathematical properties leading to better learning [1].
An activation function is considered good if it can generalise better on a variety of datasets, ensure faster convergence and improve neural network performance, which leads to more accurate results. At the early stage of deep learning The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Sharif . research, researchers used shallow networks (fewer hidden layers), along with tanh or sigmoid were used as activation functions. As the research progressed and deeper networks (multiple hidden layers) came into fashion to achieve challenging tasks, the Rectified Linear Unit (ReLU) [2], [3] emerged as the most popular activation function. Despite its simplicity, deep neural networks with ReLU have learned many complex and highly nonlinear functions with high accuracy.
To overcome the shortcomings of ReLU (non-zero mean, negative missing, unbounded output, to name a few [4]), and to increase the accuracy considerably in a variety of tasks in comparison to networks with ReLU, many new activation functions have been proposed over the years. Many of these new activation functions are variants of ReLU, for example, Leaky ReLU [5], Exponential Linear Unit (ELU) [6], Parametric Rectified Linear Unit (PReLU) [7], Randomized Leaky Rectified Linear Units (RReLU) [8] and Inverse Square Root Linear Units (ISRLU) [9], Flexible ReLU (FReLU) [10]. In the recent past, some activation functions constructed from tanh or sigmoid have VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ achieved state-of-the-art results on a variety of challenging datasets. Most notably, among such activation functions, Swish [11] has emerged as a close favourite to ReLU. Some of these novel activation functions have shown that introducing hyper-parameters in the argument of the functions may provide activation functions for special values of these hyper-parameters that can outperform activation functions for other values of hyper-parameters, for example, see [4], [11].
In this article, we propose three trainable activation functions, which we call TanhSoft-1, TanhSoft-2, and TanhSoft-3 and show that activation functions with trainable hyper-parameters outperform several well-known and conventional activation functions, including ReLU and Swish. We have used the backpropagation technique to learn the activation parameter by the network (see [7]). We show that the proposed activation functions outperform the well-known activations. A detailed relationship and difference between the proposed Activation's and previously proposed widely used activation functions are given in Table 1.

A. MOTIVATION
Swish, GELU, and Mish are a few recently proposed activations, which gained popularity in the deep learning community. They share similar mathematical properties like smoothness, non-linearity, non-monotonic, small and bounded negative output. GELU is a popular activation widely used in Natural language processing tasks and recently used in GPT-2 [12] architecture for text generation. Swish was found by a group of researchers from Google by automated neural architecture search and shown promising results compared to ReLU. Mish is recently proposed by Misra, which shown some promising results on computer vision problems, especially on object detection task in YOLO v4 [13] model. Recently, Homomorphic encryption friendly a new novel non-linear activation function [14] is proposed by Obla et al. based on Softplus [15] polynomial approximation. Motivated by these activation functions, we are interested in constructing some activations which share similar properties while having better performance in a wide range of deep learning problems (like Image classification, Object Detection, Semantic Segmentation, Machine Translation etc.) on different datasets and models when compared to widely used activations like ReLU, Swish, GELU, and Mish. We start with a few functions like xtanh(be cx ), tanh(ax)ln(1 + e x ), ln(1+e x tanh(cx)), tanh(ax+be bx )ln(1+e x ), xtanh(ax+ be cx ), tanh(be cx )ln(1 + e x ) etc (most of these functions can be generated from the generalized functional form tanh(ax + be cx )ln(d + e x ), which we have constructed as a function generator) and conducted experiments on MNIST [16], CIFAR10 [17] and CIFAR100 [17] database on ResNet-34 [18], VGG-16 [19], DenseNet [20], MobileNet V2 [21], and Inception V3 [22] models. We found that xtanh(be cx ), tanh(ax)ln(1 + e x ), and ln(1 + e x tanh(cx)) performed remarkably well when compared to ReLU and Swish while the other functions, we extracted from the generator, either fails to perform better than ReLU or performs similar to ReLU. So we decided to further investigate these three functional forms as activations and run experiments on several standard bench-marking datasets as well as other public datasets on different deep learning problems (more detailed experiments with these three activations are given in the experiment section).

II. RELATED WORKS
An activation function that can improve neural network models performance is an active field of research. It is always hard to find the best activation function. In earlier days, Tanh and Sigmoid were mostly used as activations in networks. ReLU [2] was first proposed by Nair and Hinton in 2010, and since then, ReLU is the widely used activation function in neural network models due to its simplicity. ReLU produces a positive outcome for positive inputs while zero for negative inputs, and due to this, ReLU undergoes from vanishing gradient problem, which is known as dying ReLU [5]. Several activation functions have been suggested by researchers to overcome this problem. Leaky ReLU [5] has been proposed with a small negative linear function for negative input, and it shows promising results compared to ReLU. PReLU [7] has been introduced with a modification of Leaky ReLU and added a trainable linear part for negative inputs. Later, RReLU [8], ISReLU [9], FReLU [10], PELU [23], SiLU [24], ELU [6], and GELU [25] have been proposed and they have improved model performances. Mish [26] which has been introduced recently, has shown some improvement over ReLU and Swish [11].
Most of the functions mentioned above except PReLU, PELU, and FReLU are non-trainable. A trainable activation function contains trainable parameter(s), which are tuned via backpropagation. In the early 1990s and 2000s, during the pre-ReLU era, there were a few trainable activations proposed like Adjustable Generalized Hyperbolic Tangent [27], Sigmoidal selector [28] etc. Later, Leaky ReLU, ELU, ReLU were modified by PReLU, PELU, and FReLU, respectively, by introducing trainable parameter(s). Recently, in 2017, Swish [11], a trainable activation was found using exhaustive search [29], and reinforcement learning techniques [30], which find lots of attention from the deep learning community due to its simplicity and efficiency.

III. TANHSOFT-1, TANHSOFT-2, AND TANHSOFT-3 & THEIR PROPERTIES
The standard ANN training process involves tuning the weights in the linear part of the network; however, there is merit in the ability to custom design activation functions, to better fit the problem at hand. Real-world datasets are noisy or challenging, and it is always difficult to construct the best activation function to generalize on random datasets. It is hard to say whether an activation function will generalize successfully and replace ReLU on challenging or noisy datasets. Though there may be merit in having a custom activation function corresponding to the problem at hand, but yet it is beneficial to identify activation functions that generalize to several real-world data sets, making it easier to implement. Hence we concentrate on three activation functions, namely, TanhSoft-1, TanhSoft-2, and TanhSoft-3, and establish their generalizability and usefulness over other conventional activation functions. In what follows, we discuss the properties of these activations, experiments with complex models, and a comparison with a few other widely used activation functions.
TanhSoft-1, TanhSoft-2, and TanhSoft-3 are defined as The corresponding derivatives are Figures 1, 2 and 3 show the graph for F 1 (x; α), F 2 (x; β, γ ), and F 3 (x; δ) activation functions for different values of α and β, γ and δ respectively. Plots of the first derivative of F 1 (x; α), F 2 (x; β, γ ), and F 3 (x; δ) are given in Figures 5, 6, and 7 for different values of α and β, γ and δ respectively. A comparison between F 1 (x; α), F 2 (x; β, γ ), F 3 (x; δ) and Swish and their first derivatives are given in Figures 4 and 8. The author of [26] has reported unstable training behaviour for a specific function which can be obtained from TanhSoft-1, however, we tested and failed to find any such instability. Also, in [31] the authors have mentioned a special case which can be obtained from TanhSoft-2.
The three functions have non-monotonic curvature in the negative axis. The hyper-parameters α for TanhSoft-1, β, γ for TanhSoft-2 and δ for TanhSoft-3 plays a major role and controls the slope of the curve in both positive and negative axes as evident from Figures 1, 2 and 3. Like Swish, F 1 (x; α), F 2 (x; β, γ ), and F 3 (x; δ) are both smooth, non-monotonic activation functions and bounded below. F 1 (x; α), F 2 (x; β, γ ), and F 3 (x; δ) becomes the zero function for α = 0, β = 0 and δ = 0 respectively. F 2 (x; β, 0) becomes the linear function family tanh(β)x. For large values of some parameter while fixing the other parameters, the proposed functions converges to some known activation functions point-wise. For example, ∀x ∈ R for any fixed β > 0. (7) Also, The class of neural networks with TanhSoft-1 or TanhSoft-2 or TanhSoft-3 activation function is dense in C(K ), where K is a compact subset of R n and C(K ) is the space of all continuous functions over K (see [32]).
The proof follows from the following proposition as all three proposed activations are non-polynomial.
Proposition (Theorem 1.1 in Kidger and Lyons, 2019 [33]):-Let ρ : R → R be any continuous function. Let N ρ n represent the class of neural networks with activation function ρ, with n neurons in the input layer, one neuron in the output layer, and one hidden layer with an arbitrary number of neurons. Let K ⊆ R n be compact. Then N ρ n is dense in C(K ) if and only if ρ is non-polynomial.
We have considered several models and datasets to measure the performance of F 1 (x; α), F 2 (x; β, γ ), and F 3 (x; δ) VOLUME 9, 2021    and have compared with seven baseline widely used activation functions. A brief description about baselines are as follows: • Rectified Linear Unit (ReLU):-The rectified linear unit (ReLU) activation function was first introduced by Nair and Hinton [2], Hahnloser et al. [3] and it is one of the widely used activation function. ReLU suffers from a vanishing gradient problem known as dying ReLU.
ReLU is defined as • Leaky Rectified Linear Unit:-Leaky Rectified Linear Unit (Leaky ReLU) was proposed by Mass et al. [5]. Leaky ReLU has introduced an non-zero gradient in the    negative axis to overcome the vanishing gradient and dead neuron problems of ReLU. LReLU is defined as • Exponential Linear Units:-Exponential Linear Units (ELU) was proposed by Clevert et al. [6]. ELU is defined as where α is a hyper-parameter.
• Softplus:-Softplus was proposed by Zheng et al. [15], Dugas et al. [35] which is a smooth activation function and has non-zero gradient and defined as • GELU:-GELU was introduced by Hendrycks and Gimpel [25] and defined as • MISh:-Mish has been introduced recently by Mishra [26] and defined as In the following subsections, we have provided experimental results of TanhSoft-1, TanhSoft-2, and TanhSoft-3 with baseline activation functions such as ReLU, Leaky ReLU, ELU, Softplus, Swish, GELU, and Mish for different deep learning problems like Image classification, Object detection, Semantic segmentation, and Machine translation. We have initialized α = 0.87 for TanhSoft-1, β = 0.75, γ = 0.75 for TanhSoft-2, and δ = 0.85 for TanhSoft-3 (see [7]) and updated these hyper-parameter values via backpropagation during training as mentioned in equation (8). In the following subsections, we will provide details of our experimental setup, framework, and results. All the experiments were conducted on an NVIDIA tesla V-100 GPU with 16GB RAM.

1) MNIST
The MNIST [16] database contains image data of handwritten digits from 0 to 9. The dataset contains 60k training and 10k testing 28 × 28 grey-scale images. A 8-layer customised homogeneous convolutional neural network (CNN) architecture with 3 × 3 kernels for CNN layers and 2 × 2 kernels for pooling layers are being used. We have used channel depths of size 128 (twice), 64 (thrice), 32 (twice), a dense layer of size 128, Max-pooling layer(thrice), batchnormalization [39], and dropout [40] on the custom CNN architecture. Data augmentation method is not used. The results are reported in Table 2.

2) FASHION MNIST
Fashion-MNIST [36] is a popular computer vision database consisting of 28 × 28 pixels grey-scale images, consists of ten fashion items in each class. It has 60k training images and 10k testing images. Fashion-MNIST provides a more challenging classification problem than MNIST. The data augmentation method is not used. We have considered the same CNN architecture used in the MNIST dataset for this database as well for training and testing purpose and, the results are reported in Table 3.

3) THE STREET VIEW HOUSE NUMBERS (SVHN) DATABASE
SVHN [37] is a popular image database consists of real-world house numbers of Google's street view images with 32 × 32 RGB images. The database has 73257 training images and 26032 testing images. The database has a total of 10 classes. We have considered the same CNN architecture used in the MNIST dataset for this database as well for training and testing purpose and, the results are reported in Table 4. We have used the data augmentation method in this database.

5) TINY IMAGENET
The ImageNet Large Scale Visual Recognition Challenge(ILSVRC) is one of the most popular benchmarks for image classification problems. Tiny ImageNet Challenge is a similar type of challenges like ILSVRC for image classification, which has a smaller dataset and fewer image classes. The database has images of size 64 × 64 with 200 image VOLUME 9, 2021 classes with a training dataset of 100,000 images, a validation dataset of 10,000 images, and a test dataset of 10,000 images. Each class has 500 training images, 50 validation images, and 50 test images. We have reported results for top-1 accuracy for mean of 5 runs in Table 9 on WideResNet 28-10 (WRN 28-10) [43] model. The network is trained with The Normal initializer [7], a batch size of 32, Adam optimizer [56], 0.2 dropout rate [40], initial learning rate(lr rate) 0.01, and reduce lr rate by a factor of 10 after every 50 epochs up-to 250 epochs. We have used data augmentation method in this database.

B. OBJECT DETECTION
Object Detection is an important problem in computer vision. We have considered the Pascal VOC dataset [57] for our experiments. Results are reported on Single Shot MultiBox Detector(SSD) 300 model [58], and VGG-16(with batch-normalization) is considered as the backbone network. The model is trained on Pascal VOC 07+12 training data, and model performance is evaluated on Pascal VOC 2007 test data. The model has been trained with a batch size of 8, 5e −4 weight decay for 120000 iterations, 0.001 learning rate, SGD optimizer [59], [60] with 0.9 momentum. No pre-trained weight is used in the network. The mean average precision(mAP) is reported in Table 10 for a mean of 5 different runs.

C. SEMANTIC SEGMENTATION
Semantic segmentation is a very important problem in computer vision. We have shown our experimental results on the CityScapes dataset [61]. CityScapes training data with U-net model [62] is trained for 250 epochs, with adam optimizer [56], with batch size 32 and Xavier Uniform initializer [63], and learning rate 5e −3 . Pixel Accuracy and mean Intersection-Over-Union (mIOU) on test data have been reported on Table 11 for mean of 5 different runs.

D. MACHINE TRANSLATION
Machine Translation is a deep learning technique to translate from one language to another. For this problem, WMT 2014 English→German dataset is used. It has 4.5 million training sentences, and model performance is evaluated on the newstest2014 dataset using the BLEU score metric. We have consider an Attention-based multi-head transformer model [64] for our experiments. A 8-head transformer model is considered with 0.1 dropout [40], Adam optimizer [56], and trained for 100000 steps. Other hyper-parameters are tried to retain similar as mentioned in the original paper [64]. We have reported a Mean of 5 runs has on Table 12 on the test dataset(newstest2014).

E. COMPARISON WITH BASELINES
Based on all the experiments given in earlier sections, we observe that TanhSoft-1, TanhSoft-2, and TanhSoft-3, beats or performs equally well in most cases when compared with the baseline activation functions and under-performs marginally on rare occasions, and we provide a detailed comparison of the proposed activations with the baseline activations in Table 13. Table 13 contains the total number  TABLE 13. Baseline table for TanhSoft-1, TanhSoft-2, and TanhSoft-3 based on all the experiments. The numbers represents the total number of models  in which TanhSoft-1, TanhSoft-2, and TanhSoft-3 outperforms, equal or underperforms when compared to baseline activation functions. of cases in which the proposed activations performs better, equal or less than the baseline activations. The proposed activations outperform concerning model performance in all cases compared to ReLU, Leaky ReLU, ELU, and Softplus. Also, compared to Swish, Mish, and GELU, the proposed activations outperform most cases while under-performing on infrequent occasions.

F. COMPUTATIONAL TIME COMPARISON
We have reported the computational time comparison for TanhSoft-1, TanhSoft-2, and TanhSoft-3 and the baseline activation functions for both forward and backward pass on a 32 × 32 RGB image in ResNet-34 model in Table 14 for the mean of 100 runs. All the runs are performed on an NVIDIA Tesla V100 GPU with 16GB ram. The computational time for both forward and backward passes are reported in milliseconds (µs). From Table 14, we notice that due to the nonliterary of the proposed activations, the computational time for both forward and backward pass is slightly higher than ReLU (in milliseconds) while it is similar to Mish and better than GELU. Also, due to the non-linearity of the proposed activations, there is a trade-off between state-ofthe-art model performances and computational time. From the experimental section, we notice that compared to ReLU networks with the proposed activations networks, the model performance has increased significantly, but the computational time increased marginally.

V. CONCLUSION
We have explored three novel trainable activation functions in this work, TanhSoft-1, TanhSoft-2, and TanhSoft-3. The proposed functions are zero-centred, non-monotonic, non-zero negative bounded curve, continuous, and differentiable. In the beginning, we have conducted experiments with the three activations with constant hyper-parameters, and we found that these activations perform equally or slightly better than ReLU. Later, we tune the hyper-parameters via backpropagation and make the proposed activations trainable. In this case, we found a considerable change in results (Top-1 accuracy or mAP or mIOU or BLEU score), and they perform far better than ReLU or the other baseline activations in most of the experiments. It shows that introducing a trainable parameter plays an essential role in activation functions, and a non-zero bounded negative part & trainable parameters result in better performance. We have used hyperparameters and models with the ReLU activation function and then replace ReLU with other baseline activations & the proposed activations to compare model performances. Our empirical evaluation on different deep learning tasks like Image classification, Object Detection, Semantic Segmentation, Machine Translation in a variety of complex models on datasets like MNIST, Fashion MNIST, SVHN, CIFAR10, CIFAR100, Tiny ImageNet, Pascal VOC, CityScapes, and WMT 2014 shows that the proposed activation functions produce state-of-the-art results and have an excellent potential to replace the widely used activations functions like ReLU, Leaky ReLU, ELU, Softplus, Swish, Mish, and GELU.