New RNN Activation Technique for Deeper Networks: LSTCM Cells

Long short-term memory (LSTM) has shown good performance when used with sequential data, but gradient vanishing or exploding problem can arise, especially when using deeper layers to solve complex problems. Thus, in this paper, we propose a new LSTM cell termed long short-time complex memory (LSTCM) that applies an activation function to the cell state instead of a hidden state for better convergence in deep layers. Moreover, we propose a sinusoidal function as an activation function for LSTM and the proposed LSTCM instead of a hyperbolic tangent activation function. The performance capabilities of the proposed LSTCM cell and the sinusoidal activation function are demonstrated through experiments on various natural language benchmark datasets, in this case the Penn Tree-bank, IWSLT 2015 English-Vietnamese, and WMT 2014 English-German datasets.


I. INTRODUCTION
Recently, deep learning approaches including feed-forward networks, convolution neural networks (CNNs), and recurrent neural networks (RNNs) have shown good performance in many fields. RNNs perform especially well when applied to sequential problems such as video description [1], [2], speech recognition [3], [4], neural machine translation [5]- [7], sentiment classification from text [8], and detection from multidimensional data [9]. A RNN is a recurrent network which uses the hidden state of the previous time step as input for the current time step t as follows: where λ is the activation function; x t and h t are the input and hidden state at time step t; and W x , W h , and b are trainable weights.
Despite of the good performance of RNNs, real-world problems are becoming more complicated, meaning that plain vanilla RNNs cannot sufficiently solve them. The basic approach to solving complex problems with deep learning is to create a deeper network or a more complex network. This is also true, in RNN research; i.e., the stacking of multiple recurrent layers or the use of more complex cells, such as The associate editor coordinating the review of this manuscript and approving it for publication was Nadeem Iqbal. long short-term memory (LSTM) [10], gated recurrent unit (GRU) [11] and neural architecture search (NAS) [12] cells. Both LSTM and GRU, unlike a vanilla RNN, use a gate based on sigmoid function to pass information to the next time step. Due to the gate concept, LSTM and GRU can represent complex cells and solve the gradient vanishing problem. However, when we stack multiple layers using LSTM or GRU to solve complex problems, the gradient vanishing problem arises. This occurs because the weights are multiplied iteratively when we train RNNs and the activation functions used in RNNs are usually hyperbolic tangent and sigmoid functions which disturb the learning process given that the derivatives of the hyperbolic tangent and sigmoid functions are small.
To solve the gradient vanishing problem in RNNs, several approaches have been developed. Gulcehre et al. proposed hard-sigmoid and hard-tanh activation functions to reduce gradient vanishing in RNN [13]. Le et al. and Li et al. proposed IndRNN, which uses ReLU instead of a hyperbolic tangent as an activation function in RNNs [14], [15]. Le et al. proposed IRNN, which uses an identity matrix and scaled weight initialization to apply ReLU to RNNs [14]. Li et al. proposed the independently RNN (IndRNN), which uses the Hadamard product for independently learned neurons and which also enables ReLU as an activation function [15]. Gonnet and Deselaers proposed independently long short-term memory (ILSTM) which applies the IndRNN concept to LSTM, resulting in better performance while also avoiding the overfitting issue [16]. LSTM is expressed as follows: where x t and h t are correspondingly the input and hidden state at time step t; represents the Hadamard product; σ , λ j , and λ h are the sigmoid function and activation functions used to calculate j t and h t , respectively; and W and b are learned parameters of LSTM cells. However, despite the superior performance of ILSTM, the gradient vanishing problem can also occur in this case because it uses a hyperbolic tangent for the activation functions of λ j and λ h . In this paper, we propose two different novel activation techniques for RNNs. In the first, we newly locate the activation function λ c for the cell state instead of λ h for the hidden state, as shown in Figure 1. The newly applied position for the activation function makes the proposed cell transfer larger gradients to the next layer and retain the complexity in time steps. Thus, the proposed cell is referred to as long short-time complex memory (LSTCM). With this new activation technique, the proposed LSTCM cell reduces the gradient vanishing problem in the layers, thus creating and training a deeper network for complex problems. The second technique is the novel application of a sinusoidal function as an activation function for RNNs. Sitzmann et al. proposed a sinusoidal function as an activation function for CNNs with well initialized weights in implicit neural representations such as natural images and 3D shapes [17]. Thus, we apply the sinusoidal function as an activation function for LSTM and the proposed LSTCM instead of a hyperbolic tangent. Experiments on various tasks demonstrated that the proposed LSTCM cell outperforms the LSTM cell in a deeper network. Moreover, when using the sinusoidal function as an activation function for LSTM and LSTCM cells, they outperform the traditional hyperbolic tangent activation function. This paper is organized as follows. Section II proposes the new activation techniques for RNNs including LSTCM cells and a sinusoidal activation function. In Section III, the experiments conducted here, in this case a language modeling task and a machine translation task, are described and the results are discussed. Finally, concluding remarks follow in Section IV.

II. METHODS
In this section, we propose novel activation techniques for LSTM, including the LSTCM cell and sinusoidal activation function. First, we explain backpropagation through time in LSTM, which is the basis of the proposed LSTCM cell. Then, the proposed LSTCM cell is explained in details. We also explain how to apply the sinusoidal function as an activation function for LSTM and LSTCM instead of the hyperbolic tangent function.

A. BACKPROPAGATION THROUGH TIME IN LSTM
Despite the fact that LSTM has long been studied, the vanishing gradient problem remains associated with it. Equation (3) expressed backpropagation through time (BPTT) of LSTM at time step t. In Equation (3), δx refers to ∂L/∂x where L is the loss function; is the cumulative gradient from the layers above calculated from all gradients of each state vector; λ and σ are derivations of activation functions; and {ō t ,ī t ,j t ,f t } are state vectors before the activation functions. As shown in Equation (3), the gradients are calculated through multiplication with .λ , which is less than 1, recurrently in time steps until t = 0, at which point the gradient vanishing problem can occur. Additionally, all RNN architectures stack multiple cells as layers to create a deep network [18], implying that h n,t , which is the hidden state of the n-th layer, becomes x n+1,t , which is the input to the n + 1-th layer and h n,t is calculated using c n,t after the activation function λ h . Thus, when we backpropagate the gradients in LSTM, the small value λ is propagated through the layers and the gradient decaying process accelerates.

B. THE PROPOSED LSTCM CELL
The proposed LSTCM cell applies an activation function to the cell state instead of the hidden state. This is described as follows: As shown in Equation (4), the difference between LSTM and the proposed LSTCM is that LSTCM applies the activation function λ c instead of λ h . When we stack multiple LSTCM cells as layers to create a deep network, h n,t is calculated using c n,t without the activation function λ. Thus, the backpropagated gradients through f t , i t , and j t exceed those of LSTM.
Equation (5) shows the BPTT of the proposed LSTCM.
In Equation (5), δx refers to ∂L/∂x where L is the loss function; is the cumulative gradient from the layers above as calculated from all gradients of each state vector; λ and σ are derivatives of the activation functions; and {ō t ,ī t ,j t ,f t } are state vectors before the activation functions. The greatest difference compared to LSTM is δc t . In the BPTT of LSTM, δc t is calculated by multiplying λ h (c t ) by δh t , whereas in the BPTT of the proposed LSTCM, δc t is calculated by multiplying λ c (c t+1 ) by δc t+1 . Thus, in the proposed LSTCM cell, the gradient through c t+1 becomes smaller, but the gradient through h t becomes larger. Consequently, the proposed LSTCM cell backpropagates a larger gradient through the layers than LSTM and shows better performance in deeper networks.

C. RECURRENT WEIGHT
In this subsection, we explain recurrent weights which are the most important part in RNNs to maintain the information from previous time steps to current one [19]. It means that the past state (h t ) and its gradient affect the current state (h T ) and its gradient in RNNs. Thus, for a stable learning of RNNs, the gradient must be in [ , γ ], i.e. ≤ ∂J T ∂h t ≤ γ where J T is an objective function to minimize in time step T . In this formulation, when the calculated gradient is less than , the gradient vanishing problem occurs, and when the calculated gradient is larger than γ , the gradient exploding problem occurs. Therefore, if we initialize the recurrent weights in a certain range to keep the gradient in [ , γ ], then RNNs are learned in stable manner without gradient vanishing or exploding problems. We conducted experiments to find out the proper initialization for recurrent weights in LSTCM cells and the results are explained in Section III-D.

D. USING AN ACTIVATION FUNCTION WITH A SINUSOIDAL FUNCTION
Several studies have use a periodic function as an activation function for deep neural networks [20], [21]. Particularly, Sitzmann et al. held that a periodic activation function was better than traditional activation functions in complicated signal problems such as natural images and 3D shapes [17]. The natural language problem is also a complicated signal problem, and LSTM and LSTCM as proposed in this paper also use activation functions, specifically λ h and λ c , respectively. Thus, we apply the sinusoidal function as an activation function for LSTM and LSTCM, λ h and λ c , instead of a hyperbolic tangent function. When we apply the sinusoidal activation function and train the network for machine translation tasks, the gradient exploding problem occurs. Therefore, we restrict the range of the sinusoidal activation function, with the final activation function then defined as follows:

III. EXPERIMENTS
In this section, we verify the proposed LSTCM cell outperforms LSTM on certain natural language tasks.

A. LANGUAGE MODELING TASK
We experimentally tested the proposed LSTCM cell on a language modeling task using Penn Treebank (PTB) dataset [22]. The purpose of the PTB dataset is to predict the next word based on a previous sequence of words. The training parameters were an initial weight of 0.1, an initial learning rate of 1.0 (decay by half at 1/2 epoch and 3/4 epoch), a batch size of 512, 1000 hidden neurons, and a dropout rate of 0.3. The experimental environment was based on https://github.com/KangSooHan/LSTCM. In addition, to prevent overfitting and to ensure stable learning, we applied dropout [23], gradient clipping [18], and warmup steps in the learning process.
To verify the effect of a deeper network on the language modeling task based on the PTB dataset, we compared one, three, and six layers of LSTM, ILSTM, LSTCM, and ILSTCM cells. We set 40, 80, and 120 epochs for the one-, three-, and six-layer models by considering the overfitting point when the training perplexity decreased but the validation perplexity increased. Each model underwent learned three times independently, and the final experimental results were calculated by averaging the perplexity of the three outcomes. Table 1 shows the results of the language modeling task based on the PTB dataset. As shown in the result, the simplest layer model shows the best performance. Because the language modeling task is relatively simple, the deeper network does not show an effect and the advantage of the proposed LSTCM, which transfers more gradients between the layers, is therefore not clear during the language modeling task.

B. MACHINE TRANSLATION TASK
Because the effect of the proposed LSTCM cell was not clear in the relatively simple task described above, we applied it to VOLUME 8, 2020 a more complex task, in this case a machine translation task. We used the IWSLT2015 English-Vietnamese dataset [24] and the WMT2014 English-German dataset for the machine translation task. The training parameters were as 10 epochs, an initial learning rate of 0.5 (decay by half at 1/2 epoch and 3/4 epoch), a batch size of 128, 512 hidden neurons, and a dropout rate of 0.3. The experimental environment found at https://github.com/tensorflow/nmt [25]. We used sequenceto-sequence models [6] and the Google Neural Machine Translation (GNMT) model [7] as the basis model, which consists of RNN cells and shows the best performance on the machine translation task.

1) MODEL
For the experiments, we used two backbone networks based on a sequence-to-sequence model. For the IWSLT 2015 English-Vietnamese dataset, we used a sequenceto-sequence model based on a basic encoder-decoder structure along with the Luong attention mechanism [6]. The encoder layer consists initially of bidirectional cells and then stacked unidirectional cells. The encoder layer calculates the attention for the result of the encoder cells using the Luong attention mechanism and then passes it to the decoder layer. The decoder layer consists of stacked unidirectional cells, and it predicts the next word based on the attention value from the encoder layer and input words. For the WMT 2014 English-German dataset, we used the GNMT model. The GNMT model is a deep LSTM network with encoder and decoder layers that also uses residual connections along with attention connections from the decoder to the encoder. The GNMT model calculates the attention value of each unidirectional cell in the encoder layer, and the decoder layer predicts the next word based on each previous attention value from encoder layer. To observe the effect of a deeper network on the machine translation task, we compared model with one, four, and seven layers using LSTM, ILSTM, LSTCM, and ILSTCM cells. The model with i layers refers to the setting of i layers for the encoder, with last layer as the bidirectional cell, while i − 1 layers are set for the decoder without a bidirectional cell.

2) DATASETS
The IWSLT data used here is from translated TED talks and contains 133K training sentence pairs. The dataset is provided by the IWSLT 2015 Evaluation Campaign [24]. We applied a data preprocessing method [26] and thus obtained 17.2K vocabulary items for English and 7.7K vocabulary for Vietnamese. We validated and tested the model using TED tst2012 and tst2013, respectively. The WMT dataset contains approximately 4M sentence pairs. Sentences were encoded by means of byte-pair encoding [27] involving the use of a shared resource target vocabulary of approximately 37K tokens.

3) TRAINING PARAMETERS
The training parameters were an initial weight of 0.1, an initial learning rate of 0.2, a batch size of 100, 512 hidden neurons, and a dropout rate of 0.3. Additionally, to prevent overfitting and to ensure stable learning, we applied dropout [23], gradient clipping [18], and warmup steps during the learning process. For the IWSLT 2015 English-Vietnamese dataset, which is a relatively small dataset, we utilized 60,000 training steps, and the learning rate decayed by half at 1/2 and 3/4 training steps. For the WMT 2014 English-German dataset, we utilized 350,000 training steps, and the learning rate decayed by half at every 17,500 steps after half of the training steps.

C. COMPARISON RESULT WITH LSTM
As shown in Section III-A, the proposed LSTCM cell did not show an advantage compared to the LSTM cell on the language modeling task because the language modeling task is relatively simple and does not require a deeper network. However, in a more complex task, in this case the machine translation task in Section III-B, the proposed LSTCM cell outperformed the LSTM cell. We used IWSLT2015 and WMT2014 datasets to compare the proposed LSTCM and LSTM cells on the machine translation task. Table 2 shows the result of the sequence-to-sequence model with Luong attention using the proposed LSTCM and ILSTCM cells as well as LSTM, ILSTM, and GRU cells trained based on the IWSLT 2015 dataset. We utilized two, four, and seven layers to compare the performances according to the layer depth. Each model underwent three independent learning trials, and the final experimental results were calculated by averaging the perplexity and BLEU scores of these three trials.

1) IWSLT 2015 ENGLISH-VIETNAMESE
As shown in Table 2, the four-layer model showed the overall best performance. The tst2012 BLEU scores of four layers using ILSTM and ILSTCM cells were 24.50 and 24.37, respectively, and the corresponding tst2013 BLEU scores were 26.71 and 26.84. Accordingly, there were no major differences between the ILSTM and ILSTCM cells. On the other hand, the tst2012 and tst2013 BLEU scores for seven layers without or with a skip connection using ILSTM and ILSTCM cells showed that the proposed ILSTCM cell outperformed the ILSTM cell at a meaningful level. Also, ILSTM and ILSTCM cells, which applied the aforementioned independent approach, outperformed vanilla LSTM and LSTCM cells, respectively, because the independent cells mitigated the overfitting problem. Table 3 shows the result of the GNMT model using the proposed LSTCM and ILSTCM cells, as well as LSTM and ILSTM cells trained based on the WMT 2014 dataset. We utilized two, four, and seven layers to compare the performance capabilities according to the layer depth. Each model underwent three independent learning trials, and the final experimental results were calculated by averaging the perplexity and BLEU scores of these three trials.

2) WMT 2014 ENGLISH-GERMAN
As shown in Table 3, the seven-layer model with a skip connection using the ILSTCM cell showed the best performance overall in terms of both the tst2013 and tst2014 BLEU scores (24.09 and 25.26, respectively). Unlike the IWSLT 2015 dataset, the seven-layer model showed the best performance on the WMT 2014 dataset, confirming that a deeper network is better on complex and large datasets. As in the experiment with the IWSLT 2015 dataset, ILSTM and ILSTCM cells, which applied an independent approach, outperformed the vanilla LSTM and LSTCM cells, respectively, and the proposed LSTCM cell showed better performance than the LSTCM cell on a deeper network. Additionally, Figures 2 and 3 depict the average gradient when we train the seven-layer GNMT model using ILSTCM and ILSTM cells without and with a skip connection for the WMT 2014 dataset, respectively. As shown in Figures 2 and 3, the average gradient for the ILSTCM cell is greater that for the ILSTM cell. This result verifies that the better performance by the proposed LSTCM cell stems from the greater level of gradient transference compared to that in the LSTM cell after applying the activation function to the cell state instead of the hidden state.

D. WEIGHT INITIALIZATION IN LSTCM
For the stable learning of RNNs, the recurrent weights need to be initialized properly to keep gradients in [ , γ ], i.e. ≤ ∂J T ∂h t . Thus, we found the proper weight initialization range experimentally. Figure 4 shows the result of the sequence-to-sequence model with Luong attention using the proposed ILSTCM cell in change of weight initialization. As shown in Figure 4, when the weights were initialized greater than 0.22, the model was not learned because the gradient exploding occurred. In this case, the BLEU score was NAN, thus it is represented as 0 in the graph. Moreover, when the weights were initialized less than 0.01, the BLEU score was very low compared to other weight initialization cases because the gradient vanishing occurred. Thus, we can conclude the proper weight initialization range for stable learning of LSTCM cells is [0.01, 0.22]. Table 4 shows the result of the sequence-to-sequence model with Luong attention using the proposed ILSTCM cell and the ILSTM cell with the sinusoidal activation function with training based on the IWSLT 2015 dataset. We compare the results between those with the sinusoidal activation function and those with the hyperbolic tangent activation function. As shown in Table 4, every layer combination with the sinusoidal activation function outperformed the corresponding cases with the hyperbolic tangent activation function. The proposed ILSTCM cell especially showed an improvement than ILSTM cell when we applied the sinusoidal activation function, and the overall best performance was achieved by the four-layer model using the ILSTCM cell with the sinusoidal activation function (tst2013 BLEU score: 24.71 and tst2014 BLEU score: 27.96). Thus, we can conclude that the sinusoidal activation function is better than the hyperbolic tangent activation function on complicated networks.

F. DISCUSSION
In this subsection, we discuss the experimental results and advantages. The results show that a deeper network for relatively simple task, such as a language modeling task based on PTB datasets, showed low performance compared to those on a shallow network. On the other hand, for a more complex task, such as a machine translation task, a deeper network using four or seven layers showed better performance than those on a shallow network. More specifically, because the IWSLT2015 dataset contains fewer words and sentences than the WMT2014 dataset, the four-layer model outperformed the seven-layer model on the machine translation task based on the IWSLT2015 dataset, whereas the seven-layer model with a skip connection outperformed on the machine translation task based on the WMT2014 dataset. Additionally, ILSTM and ILSTCM, which apply independent concepts to LSTM and LSTCM, showed worse perplexity performance during the training phase. However, better perplexity during the training phase did not guarantee a model with better learning. On a language modeling task based on the PTB dataset, the difference between the perplexity level between the training and the test datasets was less when the ILSTM and ILSTCM were applied as compared to when LSTM and LSTCM were applied, meaning that the independent concept prevents the network overfitting problem. Moreover, on the machine translation task, the model using LSTM and LSTCM showed better perplexity in the training phase, whereas the model using ILSTM and ILSTCM showed a better BLEU score in the test phase. Thus, we can conclude that applying the independent concept to LSTM and LSTCM cells causes the network to train to the proper direction and prevents the overfitting problem.
The basic structure of proposed LSTCM cell is similar to LSTM cell. Thus, the well-studied approach for LSTM, especially distributed learning approach from multiple clusters or multi GPUs and performance improvement approach including dropout and layer normalization, also can be applied for LSTCM in the same manner. Moreover, as shown in Table 5, the training time for the proposed LSTCM and LSTM were not much different, therefore the existing applications of RNNs can use LSTCM cells instead of LSTM cells with ease.

IV. CONCLUSION
This paper proposed what is termed a long short-time complex memory (LSTCM) cell to solve the gradient vanishing problem in recurrent neural networks (RNNs) and long short-term memory (LSTM), especially when the network is deep. The proposed LSTCM cell applied an activation function to the cell state instead of the hidden state to transfer more of the gradient to the next layer. Moreover, we applied an sinusoidal function as an activation function of LSTCM cell instead of a hyperbolic tangent function. We conducted experiments on language modeling and machine translation tasks based on the PTB, IWLST2015, and WMT2014 datasets. The experimental results showed that the proposed LSTCM cell outperformed the LSTM cell on deeper networks for complex tasks. Furthermore, ILSTCM with the independent concept applied to LSTCM showed more stable training by preventing the overfitting problem.
SOO-HAN KANG received the B.S. degree in computer science and engineering from the Seoul National University of Science and Technology, Seoul, South Korea, in 2019, where he is currently pursuing the M.S. degree.
His research interests include machine learning and human-robot interaction.
JI-HYEONG HAN received the B.S. and Ph.D. degrees in electrical engineering from KAIST, Daejeon, South Korea, in 2008 and 2015, respectively.
From 2015 to 2017, she was a Senior Researcher with the Electronics and Telecommunications Research Institute, Daejeon. Since 2017, she has been with the Seoul National University of Science and Technology, Seoul, South Korea, where she is currently an Assistant Professor. Her research interests include machine learning, human-centered intelligent robotics, and human-robot interaction.