Introduction
Natural Language Processing (NLP) is a vast area of computer science that is concerned with the interaction between computers and human language. Language modeling is a fundamental task in artificial intelligence and NLP. A language model is formalized as a probability distribution over a sequence of words. Recently, deep learning models have achieved remarkable results in speech recognition [1] and computer vision [2]. Text classification plays an important role in many NLP applications, such as spam filtering, email categorization, information retrieval, web search, and ranking and document classification [3], [4], in which one needs to assign predefined categories to a sequence of text. A popular and common method to represent texts is bag-of-words. However, the bag-of-words method loses the words order and ignores the semantics of words. N-gram models are popular for statistical language modeling and usually perform the best [5]. However, an n-gram model suffers from data sparsity [6].
Neural Networks have become increasingly popular [7]; it has become possible to train more complex models on a much larger dataset. They outperform n-gram models and overcome the data sparsity problem [6]; semantically similar words are close in vector space. The embedding of rare words is poorly estimated, which leads to higher perplexities for rare words. With the progress of machine learning in recent years, it has become possible to train more complex models on much larger data sets [1], [2], [7]–[9]. The distributed representation of words is one of the most successful concepts, and it helps learning algorithms achieve better performance [7].
Convolutional Neural Networks (CNN) [10] recently achieved very successful results in computer vision [2]. A CNN considers feature extraction and classification as one joint task. This idea has been improved by stacking multiple convolutional and pooling layers, which sequentially extract a hierarchical representation of the input [10]–[12].
We investigate Recurrent Neural Networks (RNNs) as an alternative for pooling layers in deep neural network language models to perform a sentiment analysis of a short text. Most of the deep learning architectures for NLP require stacking many layers to capture long-term dependences due to the locality of the convolutional and pooling layers [13]. Our architecture was inspired by the recent success of RNNs in NLP applications and the fact that RNNs can capture long-term dependencies even with one single layer [14]. We were also inspired by the successful work proposed in [9], where a single layer of CNN was applied for sentence classification.
It turns out that providing the network with good initialization parameters can have a significant impact on the accuracy of the trained model and capturing the long-term dependencies more efficiently. In this paper, we present a joint CNN and RNN architecture that takes the local features extracted by a CNN as the input for an RNN for a sentiment analysis of short texts. We propose a new framework that exploits and combines convolutional and recurrent layers into one single model on top of pre-trained word vectors. We utilize long short-term memory (LSTM) as a substitute for pooling layers in order to reduce the loss of detailed, local information and capture long-term dependencies across the input sequence. Our contributions are summarized below:
Word embeddings are initialized using a neural language model [7], [8], which is trained on a large, unsupervised collection of words.
We use a convolutional neural network to further refine the embeddings on a distance-supervised dataset. We take the word embedding as the input to our model in which windows of different length and various weight matrices are applied to generate a number of feature maps.
The word embeddings and other parameters of the network obtained at the previous stage are used to initialize the same framework.
The deep learning framework takes advantage of the encoded local features extracted from the CNN model and the long-term dependencies captured by the RNN model. Empirical results demonstrated that our framework achieves competitive results with fewer parameters.
The rest of the paper is organized as follows. Section II presents related works. Section III introduces background. Section IV highlights the research problem and motivation. Section V describes in detail our model architecture. Section VI outlines the experimental setup, and Section VII discusses the empirical results and analysis. Finally, Section VIII presents the conclusion.
Related Work
A. Traditional Methods
Text classification is significant for NLP systems, where there has been an enormous amount of research on sentence classification tasks, specifically on sentiment analysis. NLP systems classically treat words as discrete, atomic symbols where the model leverages a small amount of information regarding the relationship between the individual symbols.
A simple and efficient baseline method for a sentence structure is to represent the sentence as a bag-of-words and then train a linear classifier (e.g., a logistic regression). However, the bag-of-words approach omits all of the information about the semantics and ordering of words [15], [16]. N-gram models are another popular method to represent a sentence. This method usually performs the best [5]. Words are projected to a high-dimensional space, and then the embedding is combined to obtain a fixed-size representation of the input sentence, which later is used as an input to the classifier. Despite the fact that n-gram models take into account word ordering in short sentences, they do still suffer from data sparsity. Overall, all simple techniques have limitations for certain tasks. Furthermore, linear classifiers do not share parameters among features and classes that might limit their generalization in the context of a large output, where some classes have few examples. A popular solution for this problem is to use multilayer neural networks [13], [16], or to factorize the linear classifier into low-rank matrices [8].
B. Deep Learning Methods
Deep Neural Networks (DNNs) have achieved significant results in computer vision [2], [17] and speech recognition [1]. Recently, it has become more common to use DNNs in NLP applications, where much of the work involves learning word representations through neural language models [6]–[9] and then performing a composition over the learned word vectors for classification. These approaches have led to new methods for solving the data sparsity problem. Consequently, several neural network-based methods for learning word representations followed these approaches.
DNNs jointly implement feature extraction and classification for text classification. DNN-based approaches usually start with an input text, represented as a sequence of words, where each sequence is represented as one-hot vector; then, each word in the sequence is projected into a continuous vector space. This happens by multiplying it with a weight matrix, which leads to the creation of a sequence of dense, actual, valued vectors. This sequence then feeds into a DNN, which processes the sequence in multiple layers, resulting in prediction probability. This pipeline is tuned jointly to maximize the classification accuracy on the training sets [7]–[9], [12], [13], [17], [18]. However, one-hot vector makes no assumption about the similarity of words, and it is also very high-dimensional [9], [18].
RNNs improve time complexity and analyze texts word-by-word, then preserve the semantic of all of the previous text in a fixed-sized hidden layer [19]. The capability to capture superior, appropriate statistics could be valuable to capture the semantics of a long text in an RNN. However, an RNN is a biased model; recent words are more significant than earlier words. Therefore, they key components could appear anywhere across the document, not only at the end. This might reduce the efficiency when used to capture the semantics of a whole document. Therefore, the long short-term memory (LSTM) model was introduced to overcome the difficulties of the RNN [20].
A standard RNN makes predictions based only on considering the past words for a specific task. This technique is suitable for predicting the next word in context. However, for some tasks, it would be efficient if we could use both past and future words in tagging a task, as part-of-speech tagging, where we need to assign a tag to each word in a sentence [21]. In this case we already know the sequence of the words, and for each word we want to take both words to the left (past) and to the right (future) into consideration when making a prediction. That is exactly what the Bidirectional Neural Network (BNN) does; it consists of two LSTMs. One runs forward from left to right, and the other runs backward from right to left. This technique is successful in tagging tasks and for embedding a sequence into a fixed-length vector [18].
Convolutional Neural Networks (CNNs) were initially designed for computer vision [2], [10]. CNNs exploit layers with convolving filters that are applied to local features. CNNs reached outstanding results in computer vision where handcrafted features were used, e.g. scale-invariant features transform (SIFT) followed by a classifier; the main idea is to consider feature extractors and classifiers as one jointly trained task [9], [17]. The use of neural networks inspired many researchers after the successful approaches in [6], [16], and [17]. This area has been investigated in recent years, especially by using multi-convolutional and pooling layers in CNNs and then sequentially extracting hierarchical representations of the input.
CNN models for NLP achieved excellent results in semantic parsing [22], sentence modeling [11], search query retrieval [23], and other NLP tasks [17]. Recently, the DNN-based model has shown very good results for several tasks in NLP [9], [12], [13], [18]. Despite the good performance of these models, in practice they are relatively slow at training and testing, which restrains them from using a large scale of data, and it requires stacking many convolutional layers in order to capture long-term dependencies.
The combination of both CNNs and RNNs is explored for speech recognition [24], and a similar approach was applied to image classification [2]. [18] Investigated the combination of CNN-RNN to encode character input, and implemented a high-level feature input sequence of character level to capture sub-word information. However, this model performs best when a large number of classes are available. Reference [25] Outlined structured attention networks, which incorporate graphical models to generalize simple attention, describe the technical machinery and computational techniques for backpropagation through models of this form. Referenec [26] aimed to improve representation efficiency, and the model employed Differential State Framework (DSF). DSF models maintain longer-term memory by learning to interpolate between a fast-changing, data-driven representation and a slowly changing, implicitly stable state. Reference [27] Investigated an approach to advance the accuracy of the deep learning method for sentiment analysis by incorporating domain knowledge. This paper combined domain knowledge with deep learning, using sentiment scores learnt by regression to augment the training data. They also utilized weighting across entropy with a penalty matrix as an enhanced loss function.
We observed that the use of a vanilla CNN for text classification has one drawback. In [18] the network must have many layers in order to capture long-term dependencies in an input sentence. Perhaps that might be the motivation behind [12], which utilized a very deep convolutional network with six convolutional layers followed by two fully connected layers.
Background
A. Convolutional Neural Networks
Recently CNNs were applied to NLP systems and accomplished very interesting results [9], [13], [18]; convolutional layers are similar to a sliding window over a matrix. CNNs are numerous layers of convolutions with nonlinear activation functions, such as ReLU or tanh, applied to the results. In a classical, feed-forward neural network, each input of a neuron is attached to each output in the next layer. This is called a fully connected or affine layer. However, CNNs have different approaches where they use convolutions over the input layer to compute the output. Local connections compute the output over the input layer, and then each layer applies different kernels, usually hundreds or thousands of filters, to then combine their results.
During pooling or subsampling layers and during the training stage, CNNs learn the values of their filter size based on the tasks. For instance, in image classification [2] a CNN might learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to detect higher-level features, such as facial shapes, in higher layers. The layer is then fed to a classifier that uses these high-level features. However, how does this apply to NLP?
As an alternative to image pixels, the input to most NLP tasks consists of sentences and documents represented as a matrix. Additionally, each row of the matrix matches up to one token, usually a word or character. Each row is a vector that represents a word. Typically, this vector is a word-embedded, low-dimensional representation (e.g. word2vec, one-hot vectors) that indexes the word into a vocabulary (e.g. a ten word sentence using a 100-dimensional embedding, 10*100 matrix) as our input. In NLP, a filter slides over full words of the matrix. Therefore, the width of the filters is same as the width of the input matrix. Moreover, the region size may vary, but it is usually a sliding window over two to five words at a time.
B. Recurrent Neural Networks
The intuition of RNNs is that humans do not start their thinking from scratch every second. The objective of an RNN is to make use of sequential information. The output is based on the previous computation. In traditional Neural Networks, all inputs are independent of each other. While this approach is inefficient for many tasks in NLP (e.g. predicting the next word in a sentence), in this case it is important to know the previous word in order to predict the next word in context. RNNs have shown great success in many NLP tasks [1], [14], [20], [21], [28]. RNNs have a memory which captures information in arbitrary, long sequences.
RNNs are deep neural networks that are deep in temporal dimension and have been used widely in time sequence modeling. The objective behind RNNs for sentence embedding is to find a dense and low-dimensional semantic representation by recurrently and sequentially processing each word in a sentence and mapping it into a low-dimensional vector. The global contextual feature of the whole text will be in the semantic representation of the last word in the sequence [1], [29], [30]. We also can think of RNNs as multiple copies of the same network, where each one is passing a message to the inheritor. What will happen if we unroll the loop as shown in Figure 1?
We can compute the output as follows in a simple RNN: \begin{align} o_{t}=&~f(W_{o} h_{t}) \\ h_{t}=&~\sigma (W_{h} h_{t-1} +W_{x} x_{t}) \end{align}
Research Problem and Motivation
The objective of using the convolutional layer is for it to learn to extract higher-level features that are invariant to local translation, and, by assembling multiple convolutional layers, the model can extract higher-level translation invariant features from the input sequence. Regardless of this advantage, we observed that most of the existing deep models require multiple layers of convolutional to capture long-term dependencies, and that is because of the locality of the convolutional and pooling layers. This issue becomes more crucial as the length of the input sequence grows. Most of the combination CNN-RNN models applied several types of pooling. We argue that the pooling layer is the reason for lost details in local information, because the pooling layer only captures the most important feature in a sentence and ignore the others; therefore, we attempt to utilize an RNN as an alternative for the pooling layer to capture long-term dependencies more efficiently and also to reduce the number of the parameters in the architecture. Based on these observations, we focused on proposing a simple and efficient combined model that focuses on parameter reduction by excluding the pooling layer from the architecture, while also capturing long-term dependencies more efficiently in terms of accuracy by using the LSTM layer as an alternative to the pooling layer. Only one convolutional layer was applied to extract the most important features in the document; no pooling layers were involved. Further, we fed the feature maps to the recurrent layer to capture long-term dependencies for more efficient classification as shown in Figure 3.
Model Architecture
In this section, we present the details of the framework model, which consists of convolutional and recurrent neural networks. Our model’s architecture uses word embeddings as inputs and takes them to a convolutional neural network to learn to extract high-level features, whose outputs are then given to a long short-term memory recurrent neural network language model, and are finally followed by a classifier layer.
A. The Embedded Layer
The first layer of the network transforms words into real-valued feature vectors that capture semantic and syntactic information. Our model’s input is a sequence of words
B. The Convolutional Layer
The model architecture in Figure 2 is a slight alternative of the CNN architecture of [31].
The proposed CNN-LSTM architecture compare to traditional CNN-RNN with max-pooling architecture.
\begin{equation} x_{1:n} =x_{1} \oplus x_{2} \oplus \ldots \oplus x_{n}, \end{equation}
\begin{equation} c_{i} =f(W\cdot X_{i:i+h-1} +b). \end{equation}
\begin{equation} c=[c_{1},c_{2},\ldots,c_{n-h+1}], \end{equation}
C. The Recurrent Layer
RNN is a type of neural network architecture specially used for sequence modeling. At each time step \begin{equation} h_{t} =f(W_{xt} +Uh_{t-1} +b) \end{equation}
\begin{align} i_{t}=&~\sigma (W^{i}x_{t} +U^{i}h_{t-1} +b^{i}) \\ f_{t}=&~\sigma (W^{f}x_{t} +U^{f}h_{t-1} +b^{f}) \\ O_{t}=&~\sigma (W^{o}x_{t} +U^{o}h_{t-1} +b^{o}) \\ g_{t}=&~\sigma (W^{g}x_{t} +U^{g}h_{t-1} +b^{g}) \\ c_{t}=&~f_{t} \Theta c_{t-1} +i_{t} \Theta g_{t} \\ h_{t}=&~o_{t} \Theta \tanh (c_{t}) \end{align}
D. Back Proppagation Through Time
Back propagation through time (BPTT) is the key algorithm that makes training deep models computationally controllable, and it is a way of computing gradients of expression through the recursive application of the chain rule. The core issue we are given is some function
E. The Classification Layer
The classification layer is, in principle, a logistic regression classifier. It gives a fixed-dimensional input from the lower layer; the classification layer affine transforms it, followed by a softmax activation function to compute the predictive probabilities for all of the categories [35]. This is done by:\begin{equation} p(y=k\vert X)=\frac {\exp (w_{k}^{T} x+b_{k})}{\sum \limits _{k=1}^{k} {\exp (w_{k}^{T} x+b_{k})} } \end{equation}
Experimental Setup
A. Sentiment Analysis Datasets
The performance of the proposed model was evaluated on two benchmark sentiment analysis datasets: the Stanford Large Movie Review dataset (IMDB) and the Stanford Sentiment Treebank dataset (SSTb) [31], derived from Rotten Tomatoes movie reviews [36].
B. Stanford Large Movie Review Dataset (IMDB)
The IMDB dataset was first proposed by [32] as a benchmark for sentiment analysis. It consists of 50,000 binary labeled reviews; the reviews are divided into 50:50 training and testing sets. The distribution of labels with each subset of data is balanced. We used 15% of the labeled training documents as a validation set. One key aspect of this dataset is that each review has several sentences.
C. Stanford Sentiment Treebank Dataset (SSTb)
The SSTb dataset was first proposed by [36] and extended by [31] as a benchmark for sentiment analysis. It consists of 11,855 reviews taken from the movie review site Rotten Tomatoes, with one sentence for each review. The SSTb was split into three sets: 8544 sentences for training, 2210 sentences for testing, and 1101 sentences for validation (or development). The SSTb also includes fine-grained sentiment labels. In Table 1, we present additional details about the two benchmark datasets.
D. Hyperparameters and Trainnig
We used stochastic gradient descent (SGD) to train the network and the back-propagation algorithm to compute the gradient. We believe that by adding a recurrent layer to the model as an alternative to the pooling layer, we can effectively reduce the number of the convolutional layers needed to capture long-term dependencies. Therefore, we consider merging a convolutional and recurrent layer into one single model. Our architecture goal is to reduce the need for stacking multiple convolutional and pooling layers in the network in order to reduce the loss of detailed, local information. Thus, in the proposed model, we consider convolutional layers with only one layer that has
E. Regularzation
For regularization we employ dropout as an effective method to regularize deep neural networks and neural networks. Dropout prevents co-adaption of hidden units. We apply it with constraint on the L2-norms of the weight vectors [37]; we insert dropout modules in between CNN and LSTM layers to regularize them.
F. Unsupervised Learning of Word-Level Embeddings
Initializing word vectors with those obtained from an unsupervised neural language model is a popular method to improve performance in the absence of a large, supervised training set [35], [38]. It has been recently shown that improvements in model accuracy can be obtained by performing unsupervised, pre-trained word embeddings. In our experiments, we utilized the publicly available word2vec vectors that were trained on 100 billion words from Google news. The vectors were trained using a continuous bag-of-words algorithm [7]. While the word embeddings are obtained, the model captures syntactic and semantic aspects of the words they represent; however, they have no notion about their sentiment behavior. Word embeddings play an important role in our neural language model. They are able to capture syntactic and semantic information, which are very significant to sentiment analysis.
Empirical Results and Analysis
A. Optimization
Training was done through stochastic gradient descent over shuffled mini-batches. For training and validation, we randomly split the full training examples. The size of the validation set is the same as the corresponding test size and is balanced in each class. We trained the model by minimizing the negative log-likelihood or cross entropy loss. Early stopping was utilized to prevent overfitting. In our work, we employed unsupervised learning of word-level embedding using the word2vec, which implemented the continuous bag-of-words and skip-gram architectures for computing vector representations of a word. We validated the proposed model on two datasets, considering the difference in the number of parameters. However, the accuracy of the model does not increase with the number of convolutional layers. More pooling layers typically leads to the loss of long-term dependencies. Therefore, in our model we removed the pooling layer from the convolutional network and replaced it with a recurrent layer to reduce the loss of local information. One recurrent layer is enough to capture long-term dependencies in the input sequence.
B. Analysis of the Stanford Sentiment Treebank Dataset
For the Stanford Sentiment Analysis dataset (SSTb), we performed several experiments to offer a fair comparison with competitive models. We followed the experimental protocols as described in [31]. To make use of the available labeled data, our model treats each sub-phrase as an independent sentence, and we learn the representation for all of the sub-phrases in the training set. We initialized the word vectors with the unsupervised learning of word-level embedding using the word2vec algorithm, which implements continuous bag-of-words and skip-gram architectures for computing vector representations of a word. The Positive/Negative presents results for the binary classification of sentences, and the fine-grained analysis predicts results for the case where five sentiment classes are used (positive, very positive, negative, very negative, and neutral). We report the accuracy of different methods in Table 3. The primary highlight of our result on the SSTb benchmark dataset is that traditional methods (SVV, NB, BiNB) with bag-of-words perform poorly compared to our proposed deep learning language model. We observed 4%–12% absolute improvement in terms of accuracy with the baseline methods proposed in [39]. Initializing word-embeddings using unsupervised, pre-trained vectors gives the model an absolute accuracy that increased around 8% when compared to randomly initializing the vector with a CNN-only architecture [9]. Our model does not require pooling layers, which leads to the more efficient capture of local information compared to the networks proposed in [12] and [18]. The best previous result was reported by [31] and [40] for SSTb. Our approach provides a 4% improvement in accuracy over the RNTN method. We also reported an 8% performance enhancement over the matrix-vector-RNN. In fine-grained classification tasks, our method has an absolute improvement of 7% in terms of accuracy. Figures 4 and 5 show that SSTb (binary and fine-grained), bag-of-n-words model, and (NB, SVM, BiNB) perform poorly on the dataset. A similar model was proposed in [41] and achieved better performance in terms of accuracy; however these models have more hyperparameters and require subsampling layers. On the other hand, our proposed model performed very competitively and came close to matching other state-of-the-art algorithms on both the binary and fine-grained sentiment analyses on the SSTb dataset with fewer parameters.
C. Analysis of the IMDB Dataset
Beyond one sentence, each movie review consists of several sentences in the IMDB dataset. The results of our method are reported in Table 4 on the IMDB benchmark dataset compared to other approaches. Reference [31] Applied several methods on the IMDB dataset and found that their Recursive Neural Tensor Network worked much better than a bag-of-words model; however, this model required parsing and took into account the compositionality. Our method performs better than all of the baselines reported in [39]: MNB-uni, MNB-bi, SVM-uni, SVM-bi, NBSVM-uni, and NBSVM-bi, with an approximate improvement of 2–12% in terms of accuracy.
When we compared the proposed model with a combined Restricted Boltzmann Machines model [42], bag-of-words, and WRRMB + BoW (bnc), we achieved 4%–7% relative improvement and 1%–6% compared with Bow (bnc), Full + Unlabeled+BoW, and paragraph vector [43]. The paragraph vectors proposed in [44] achieved a state-of-the-art result on the IMDB dataset; however the model has a reputation for being extremely difficult to tune and requires a downsampling parameter to reduce the feature map dimensionality for computational efficiency. We found that our proposed architecture, with no downsampling layer, achieved competitive results on the IMDB dataset as shown in Figure 6. The CNN-RNN with max-pooling loses detailed, local features due to the pooling layers in the architecture. Compared with the existing methods and experiment results, we found that the approach takes advantage of both CNN and RNN models on the sentiment classification of short texts.
Our experimental results suggest that by using a LSTM layer on top of a CNN architecture, one can effectively reduce the number of convolutional layers needed in order to capture long-term dependencies. Furthermore, we observed that many factors affect the performance of deep learning models, such as: the dataset size, vanishing/exploding of the gradients, and choosing the best feature extractors and classifiers, which are all still open research areas. However, there is no specific model for all types of datasets.
D. Overview
The challenge in NLP is to develop an architecture that can learn the hierarchal representation of the whole sentence jointly with the task. Convolutional neural networks consider feature extraction and classification as one jointly trained task. The idea of CNNs has been improved upon recently [9], [13], [16]–[18] by using multiple layers of convolutional and pooling to sequentially extract hierarchal representation of input. Reducing the network size has been the interest of several works. More compact layers are also used, likely by replacing the fully connected layers with average pooling [15]. In [16] the weights are constrained by binary, which considerably reduces the memory consumption. To design a simpler network, [15] removed redundant connections and allowed weight sharing. In our work we conducted a series of experiments with both deep learning and traditional methods to offer a fair comparison to competitive models on sentiment analysis benchmark datasets. We did our best to select the architectures that would deliver comparable and competitive results. Despite the fact that the CNN-RNN proposed in [41] has a slightly higher classification accuracy compared to our proposed model, we argue that this result is due to the use of max pooling on adjacent words. However, our proposed architecture is simple and efficient in term of layers. Moreover, our model has significantly fewer parameters, which means less memory consumption. We reported very competitive results in terms of accuracy in comparison to the model proposed in [41]. The reported result shows that, compared to the currently most popular LSTM, CNN, and CNN-LSTM methods, our proposed framework can achieve similar or even better performance on sentiment analysis tasks.
Conclusion
Convolutional neural networks (CNN) learn to extract higher-level features that are invariant to local translation. Despite this advantage, it requires many layers of convolution to capture long-term dependencies, due to the locality of the convolutional and pooling. This becomes more severe as the length of the input sequence grows. Ultimately, this leads to the need for a very deep network with many convolutional layers. In this article, we presented a new framework to overcome this problem. In particular, we aimed to capture the sub-word information and reduce the number of the parameters in the architecture. Our framework jointly combines CNN and recurrent neural networks (RNN) on top of unsupervised, pre-trained word vectors; recurrent layers are expected to preserve ordering information even with one single layer. Thus, we exploited a recurrent layer as a substitute for the pooling layer to hypothetically reduce the loss of details in local information and capture long-term dependencies more efficiently.
Our approach performed well on two benchmark datasets and achieved a competitive classification accuracy while outperforming several other methods. Our results demonstrated that it is possible to use a much smaller architecture to achieve the same level of classification performance. It will be interesting to see future research on applying the proposed method to other applications such as information retrieval or machine translation.