Multi-Task Learning Model Based on Multi-Scale CNN and LSTM for Sentiment Classification

Sentiment classification is an interesting and crucial research topic in the field of natural language processing (NLP). Data-driven methods, including machine learning and deep learning techniques, provide one direct and effective solution to solve the sentiment classification problem. However, the classification performance declines when the input includes review comments for multiple tasks. The most appropriate way of constructing a sentiment classification model under multi-tasking circumstances remains questionable in the related field. In this study, aiming at the multi-tasking sentiment classification problem, we propose a multi-task learning model based on a multi-scale convolutional neural network (CNN) and long short term memory (LSTM) for multi-task multi-scale sentiment classification (MTL-MSCNN-LSTM). The model comprehensively utilizes and properly handles global features and local features of different scales of text to model and represent sentences. The multi-task learning framework improves the encoder quality, simultaneously improving the results of emotion classification. Six different types of commodity review datasets were employed in the experiment. Using accuracy and F1-score as the metrics to evaluate the performance of the proposed model, comparing with methods such as single-task learning and LSTM encoder, the proposed MTL-MSCNN-LSTM model outperforms most of the existing methods.


I. INTRODUCTION
In the information era, Internet technology has greatly influenced human lives [1]. There are review comments available for every service or product that exists in our daily life on the Internet. The sentiments included in the review comments become increasingly important for merchants to analyze and expand their business [2]. On one hand, user reviews reflect the quality and problems of products, which is important for the merchants. One the other hand, the number review reports can be tremendous and from different perspectives, it is difficult to conclude the review reports manually. As a traditional research topic in the field of natural language processing (NLP), sentiment classification aims at mining users' attitudes and perceptions (such as positive, negative) from The associate editor coordinating the review of this manuscript and approving it for publication was Mouloud Denai . text descriptions generated by user awareness [3]. Therefore, sentiment analysis is also called opinion mining and point of view analysis, which has been widely studied in e-commerce sites, social networks and other fields [4].
Through the ages, the methods of sentiment classification have undergone many changes, from the initial sentiment dictionary-based methods [5] to machine learning methods, such as Support Vector Machine (SVM), Naive Bayes (NB), decision tree, random forest, logistic regression, etc. [6], [7]. Although some machine learning methods can achieve good results on some tasks, due to the complexity of feature engineering [8], the effects of these methods are very dependent on feature representation, and difficult to achieve acceptable classification results [9], [10]. With the popularity of deep learning, many deep learning methods are applied to sentiment classification tasks [11]- [13]. Compared with machine learning methods, deep learning does not need to manually extract features. However, the robustness and generalization capability of the deep learning models largely depends on the amount of data available in the training phase.
With the development and maturity of deep learning methods, various structured deep learning networks have been proposed for different applications [14]- [17]. In order to improve the effectiveness of the sentiment classification task, this paper proposes a model based on Multi-scale CNN and LSTM. It treats the review text of different commodities differently, while retaining the features that can be shared between each text, and combines the local features and temporal features (global features) of the text to improve the classification effect. The model consists of two main parts. The first part is a multi-task learning framework, which is used to capture shared features (that is, features independent of the type of commodity) and private features (that is, features closely related to the type of commodity) from reviews of different types of commodity. More specifically, it provides a private encoding scheme for each type of commodity comment, a shared encoding scheme for all comments, and a separate classifier for each commodity comment. The second part is the sentence encoder. Since the input of the sentiment classification model is represented by the combination of word embeddings [18], the model needs to extract the text features from it to obtain the sentence representation and then carry out the classification task. We use a multi-scale CNN network to extract the local features of different levels of sentences, and LSTM network to obtain the global features of sentences, and fuse the local features and global features of sentences to generate the sentence representation.
The main contributions of our study can be summarized as follows: 1) The multi-task learning (MTL) approach has been adopted to conduct sentiment analysis for the reviews under various types of commodities. Compared to the single task learning (STL) approach, the MTL uses an extra shared encoder to help improve the classification performance.
2) The shared encoder in MTL has been customized with multi-scale CNN (MSCNN) network combining the LSTM network and the Fusion net. The improved shared encoder, namely MSCNN-LSTM, largely improves the sentiment analysis in multi-task conditions. 3) A comprehensive comparative study has been conducted on the proposed method, compared with the existing sentiment analysis approaches available in the literature. The experimental results demonstrate that the proposed multi-task multi-scale CNN-LSTM (MTL-MSCNN-LSTM) framework outperforms all compared methods. The remaining sections are arranged as follows: Section 2 introduces the related work. Section 3 introduces our method in detail. Section 4 shows the comparison experiment of this method. Section 5 summarizes the article and arranges the future work.

II. RELATED WORKS
As an important subject in the NLP field, sentiment classification always is an interesting and crucial research topic [6], [19], [20]. With the increasing popularity of deep learning, an increasing number of NLP tasks are based on word embeddings [21]. In 2013, Google opened word2vec, a tool for calculating word vectors [22], [23]. It can map words to low-dimensional vector space, reduce the cost of calculation, and make words to be related to each other rather than independent. It is a good distributed representation method [24], which overcomes the defect of a one-hot encoding discrete representation.
The purpose of the sentiment classification task is to obtain the sentiment polarity contained in the sentence, while the sentence information contained in the word is not complete, so the sentence encoder is needed to extract the features of the sentence, so as to generate the vector representation of the sentence. Convolutional neural network (CNN) was originally applied in the field of the image, but it has also been widely applied in the field of NLP in recent years [11], [25] and has achieved good results. The network structure of incomplete connection and weight sharing reduces the complexity of the model. The existence of a convolutional layer makes CNN good at capturing spatial local correlation, and the pooling layer in the network can greatly reduce the computation [26]. The appearance of word2vec makes it possible for CNN to apply to the text, enabling CNN to obtain local information in the text and conduct sentiment classification [27]. Zhou et al. proposed a goal-driven deep learning technique exploring user correlations in the task-oriented process [28]. Attardi and Santos proposed a char-CNN model, which is a two-layer convolutional network extracting relevant features from text sentences [29]. Similar works utilize CNN extensions to classify sentiment and completed multi-class classification tasks with good results [30]- [32].
It is not rigorous enough to define the sentiment polarity of a sentence by considering only the local features of the sentence. The global features of the sentence are an important indicator to determine the sentiment polarity. Recurrent neural networks (RNN) are suitable for extracting features of sequence-type data, and Long-Short-Term Memory Neural Network (LSTM) is improved based on RNN [33]. Although RNN is suitable for processing time-series information, it has the problem of gradient explosion or gradient disappearance [34], [35]. To solve this problem, LSTM introduces a gating mechanism into the traditional network structure of RNN to support the long-term dependence of sequence information [36], [37]. In natural language tasks, since the text itself has the property of sequence, it is very suitable to use LSTM for processing theoretically. Many works of literature have also proved that LSTM can effectively extract text semantic information [38]- [40]. Tang et al. used two LSTM networks to capture sentiment characteristics from the front and back context of an aspect and strung them together to predict the sentiment polarity of that aspect [41].  Ren et al. used LSTM combined with subject features to extract features of Twitter short texts and employed a bidirectional LSTM network for sentiment classification [42].
It is evident that the amount of data used in training affects the quality of the deep learning model, which is also true for the sentiment classification task. Multi-task learning (MTL) is an effective method to improve the performance of sentiment analysis, while there is not enough training data for any single task and related tasks' datasets are available [43]. There are various multi-task learning (MTL) architectures available and have been applied to different areas of NLP [44]- [46]. Liu et al. adopted the adversarial multi-tasking learning (AMTL) framework for multiple task text classification [47]. Luong et al. also used the multi-task learning method for machine translation tasks and achieved good results [48]. Yousif et al. completed the task of citation sentiment classification with a fully-shared multi-task learning framework [49]. Lu et al. utilized multi-task learning combined with VAE and achieved good results for multi-task sentiment classification [50].

III. METHODOLOGY
We proposed MTL-MSCNN-LSTM to perform the sentiment analysis task of commodity review text and achieved good results. The overall structure of the model is shown in Fig. 1. We will describe the proposed deep learning framework in detail in the rest of this section.

A. MULTI-TASK LEARNING FRAMEWORK
Multi-task learning shows high efficiency in many natural language tasks. Due to the semantic differences between different types of commodity reviews, although the goal is to classify emotions, we think it is also applicable to the multitask learning method. Our model adopts the adversarial multitasking learning (AMTL) framework [47] and improves the structure of its encoder. We will introduce the adversarial multi-tasking learning framework in detail in this part.
Adversarial multi-task learning is based on shared-private multi-task learning (SP-MTL), as shown in Fig. 2. For SP-MTL, the sentences in each task are assigned to two feature spaces to encode, shared space and a private space. Shared encoders are used to extract sentence features that are independent of tasks, while private encoders are used to extract sentence features that are more relevant to tasks. For sentence x k in each task k, the shared feature S k and private feature P k of the sentence will be obtained through the two encoders, and the shared representation s k and private representation p k of the sentence will be obtained after pooling. Inspired by the generative adversarial network (GAN) [51], adversarial multi-task learning tries to make the shared space contain less information of private space, and adds a task discriminator after the shared encoder. It maps the shared representation of the sentence to a probability distribution to determine which task the sentence comes from, as shown in Equation (1). During the training process, the shared encoder prevents the discriminator from making a correct judgment on the task, ensuring that the shared representation part of the sentence is independent of the task, while the discriminator tries its best to identify which task the current sentence representation comes from. The discriminator loss L D is included in the loss function, as shown in Equation (2).
where U ∈ R d×d is the parameter that can be learned, b ∈ R d is the bias, and θ D is the parameter of the discriminator. where d k i represents the true label of the current task and θ s is the parameter of the shared encoder.
While ensuring that the shared features of a sentence include as few private features of the sentence as possible, it is also necessary to ensure that the private features of the sentence do not include those shared features as much as possible. To this end, the adversarial multi-task learning framework introduces orthogonal constraints to punish redundant features and guides the encoder to extract features in different aspects. Similarly, the loss function contains the orthogonal constraint loss term L O , as shown in Equation (3).
The main purpose of the multi-task learning method is to improve the performance of a single task with the help of other related tasks. The reason why we adopt the multi-task learning method as the base of our proposed model is that we take into account the semantic differences between different types of commodity review data. These differences might become noise to interfere with the training process. This will undoubtedly increase the learning difficulty for text encoders and classifiers. The purpose of introducing multi-task learning is to distinguish the review data of different types of commodities. A shared encoder was employed to learn the common features (shared features) between reviews from different tasks. The advantage of using the shared encoder is that since all the data samples pass through the shared encoder, the shared encoder ultimately learns enriched semantic features compared to individual encoders in single task leaning. The task discriminator set by the adversarial multi-task learning framework ensures that the learning content of the shared encoder is as independent as possible from the data source. On the other hand, we set up a private encoder for each type of commodity review data to learn the semantic features related to the source of the review (private feature). The existence of orthogonal constraints makes the learning content of the private encoder as relevant to the task as possible. Finally, we set up a classifier for each task for sentiment classification tasks. The input of the classifier is a sentence representation obtained by the shared feature and the private features of the review, and the output is the classification result.
It should be noted that in order to make the general structure of the model clearer, we did not give the details of the adversarial multi-task framework in Fig. 1, but chose to put it in this part for a detailed description. Similarly, the specific structure of the sentence encoder will be given in the next part.

B. SENTENCE ENCODER
The original adversarial multi-task learning framework uses LSTM encoder to encode sentences. We believe that this only considers the global features of the sentence, and ignores the local features of the sentence to some extent. To this end, we use multi-scale convolution combined with LSTM to encode sentences, using not only the global features of the sentence for classification but also the local features of different scales of the sentence. The structure of our encoder is shown in Fig. 3. As can be seen from Fig. 3, our proposed encoder extracts local text features of different scales from a multi-scale CNN encoder, and simultaneously extracts global features of the text from an LSTM encoder, and fuses the two features to generate a completely private or shared sentence representation.

1) LSTM EXTRACTS GLOBAL FEATURE
Our encoder uses LSTM to extract the global features of a sentence and get the global representation of the sentence. Compared with RNN, LSTM adds three gate control mechanisms, namely forget gate, input gate and output gate. At the same time, it introduces the selection of dependent information on cell state control, which effectively avoids the problem of gradient explosion and gradient disappearance. Its structure is shown in Fig. 4.
The existence of the forget gate is to determine the degree of forgetting of the information flow before the current cell. The calculation is shown in Equation (4): The function of the input gate is to determine how much current information is added to the information flow.    (5) and (6):

The calculation is shown in Equations
After the information passes through the input gate and the forget gate, the LSTM updates the cell state to calculate the output of the current LSTM cell and transfer it to the next LSTM cell. The calculation is shown in Equation (7): The output gate combines the current input and cell state to determine the output of the current LSTM cell. The calculation is shown in Equations (8) and (9): The work of our encoder to extract the global features of the sentence and obtain global representation is shown in Fig. 5a. Each word in the sentence is represented by a word vector trained by word2vec. The first word vector is input to the first LSTM cell, and the output obtains the sentence feature of the current time-step and is passed to the next LSTM cell. The same is true for each subsequent word vector until all words pass through the LSTM cell and get their corresponding current sentence features. Since the input of the current timestep of the LSTM includes the output of the previous timestep, the output of the last time-step can be used as the global representation of the sentence, as shown in Fig. 5b.

2) MULTI-SCALE CNN EXTRACTS LOCAL FEATURE
CNN captures local features of sentences by scanning parts of the text through a convolution kernel. Considering the phrases, transitions, and other factors in the sentence, we choose 3, 4, and 5 sizes of convolution kernels to extract the local features of the sentence on different scales. The calculation is shown in Equation (10): where F represents a convolution kernel of r * k size. In our experiment, the size of r is set to 3, 4, 5, and k is the dimension of the word vector. In our experiment, k is set to 256. f represents ReLU activation function. V(w(i:i+r-1) indicates that there are r word vectors from the ith word to the (i + r − 1)th word in the sentence. c r i represents the ith local feature of the sentence extracted by the convolution kernel with width r. The work of our encoder to extract the local features of the sentence and get the local representations of the sentence is shown in Fig. 6.

3) FUSION OF GLOBAL AND LOCAL REPRESENTATIONS OF SENTENCES
In the work described above, the word vector has passed through the LSTM encoder and the multi-scale CNN encoder to obtain the global and local representations of the sentence. Next, the two outputs (which were two sentences) were concatenated to get the complete private or shared sentence   representation. But because these two sentence representations are not the same size. On the other hand, the output of the LSTM network finally passed the tanh activation function, and the feature size of each dimension was controlled in the interval (−1, 1), while the activation function used in the CNN network was the ReLU function. For the next classification task, the classifier can treat the sentence representations obtained by the two encoders equally. We added Fusion-Net after the sentence representations obtained by CNNs at three different scales to ensure that the local and global information contained in the sentence representations are in an ''equal'' state, as shown in Fig. 7. In Fig. 7, you can see that dropout was added after the initial local representation of the sentence, which is a measure to prevent overfitting. Similarly, we also set dropout in the LSTM cells. In order to verify the effectiveness of Fusion-Net, we set up comparative experiments in the experimental part to demonstrate this problem. The fused local representation will be concatenated with the global representation as to the final output of the sentence encoder.
Finally, the final sentence representation obtained from the output of the shared encoder and the output of the private encoder will be sent to the Softmax classifier after a number of fully connected layers (set to 3 in the experiment) for dimensionality reduction. We define the loss of classifier as Equation (12), and the final loss function is shown in VOLUME 8, 2020 Equation (13).
where λ and γ are hyperparameters, and our model is trained by SGD.

IV. EXPERIMENT A. DATASETS AND METRICS
As shown in Table 1, the dataset I used in this experiment are apparel, camera, electronics, housewares, magazines, and sports commodity reviews. There are about 2,000 reviews for each commodity, for a total of about 12,000 reviews. This is a binary classification task. These review texts contain sentiment labels (Negative or Positive). These data were collected from raw data provided by Blitzer et al. [52]. We divided the training set, validation set, and test set to the ratio of 7: 1: 2, and ensured that the number of positive and negative samples in each set did not differ much. The dataset statistics are shown in Table 2.
In addition, we collected four types of commodity review datasets of books, daily necessities, entertainment and media from raw data provided by Blitzer et al. [52] to participate in our experiments as dataset II. Dataset II has more data entries than dataset I, with a total of 23,742 entries. As with dataset I, each review text contains a sentiment label (Negative or Positive). We also divided the training set, validation set, and test set for dataset II, and ensured that the number of positive and negative samples in each set didn't differ much. Instances and statistics of dataset II are shown in Table 3 and Table 4.
In the experiment, we use the same evaluation criteria for each commodity review data set and each method, which are accuracy and F1-score. Our experiments were performed in the Pytorch environment.

B. COMPARED TO STL AND LSTM ENCODER
In order to verify that the proposed model structure meaningful, we compared the proposed model with the singletask method and the method using LSTM encoder. The experimental results of the dataset I are shown in Table 5, and the experimental results of dataset II are shown in Table 6.
Observing Tables 5 and 6, we can see that our proposed MTL-MSCNN-LSTM model performs optimally on both datasets I and dataset II. This is the result of the combined action of both the multi-task learning framework and MSCNN-LSTM encoder. The multi-task learning model allows the model to learn common knowledge from related corpora and apply it to the training process of shared encoders. At the same time, the private encoder of each commodity review data and its independent classifier reduce the noise impact of different texts. On the other hand, the MSCNN-LSTM encoder we designed not only considers the global features of the sentence, but also considers the local features on different scales, and combines these two features to classify the sentiment of the reviews. In Tables 5 and 6, the experimental results of MSCNN-LSTM are better than LSTM in both STL and MTL frameworks. Only on the Media dataset, the performance of the MSCNN-LSTM encoder is slightly worse than the LSTM encoder, and it is still in the case of STL. This proves that its ability to extract text features is obviously more powerful than LSTM encoders, and has good generality. The MSCNN-LSTM encoder we designed VOLUME 8, 2020    has undoubtedly contributed to the improvement of sentiment classification tasks.
However, in the experiment of the dataset I, when LSTM is used as the encoder, the performance of STL and MTL is not ideal, and some results even show that the effect of STL is better than MTL. Generally speaking, MTL is better than STL, but when the encoder is not optimal, the performance of MTL may appear to be less stable. Especially in the case where the performance of both is not ideal, even if STL is slightly better, it doesn't make much sense. Of course, it is also related to the size of the data set to some extent. When our encoder is adjusted to MSCNN-LSTM, the performance of MTL is significantly better than STL. This also shows from another aspect that the MSCNN-LSTM encoder is better than the LSTM encoder, and it has some improvements in the performance of the MTL framework.

C. MODEL SELF-COMPARISON
As mentioned in Section 3, the Fusion-Net was added after the local representation of the multi-scale CNN encoded sentence for the classifier treating the global and local representation of the sentence equally, and consequently to avoid the problem of gradient disappearance in the later fully connected layer. Based on this problem, we put a comparative experiment on our proposed MTL-MSCNN-LSTM model. Table 7 and Table 8 reflect the fact that the Fusion-Net we added further improves the classification effect of the MTL-MSCNN-LSTM model.
After the encoder extracts the sentence features, these features pass through the pooling layer to produce a partial representation of the sentence. Due to the characteristics of the LSTM encoder structure, it is determined that the subsequent pooling layer needs to use the Max-pooling method. For the pooling method of the pooling layer after the MSCNN   encoder, we conducted a comparison experiment on the Maxpooling and the Mean-pooling, as shown in Table 9 and  Table 10.
According to Tables 11 and 12, the proposed MTL-MSCNN-LSTM model outperforms all compared state-ofart classification methods, for both classification accuracy rates and F1-score values. The comparative results that are shown in Tables 11 and 12 further prove that the research we conducted is meaningful and our proposed method is effective and robust. VOLUME 8, 2020

V. CONCLUSION, LIMITATION AND FUTURE WORKS
In this paper, we propose a multi-task learning method for sentiment analysis of different types of commodity reviews. Taking multi-task learning as the framework, we use LSTM and multi-scale CNN to jointly perform sentence encoding, taking into account the global and local features of the text. In addition, the network structure of the CNN encoder has been customized to enhance the classification performance. We implemented model training, validation, and testing in the Pytorch environment. The experimental results show that the sentiment classification results of our proposed model are superior over existing state-of-art methods. This is because this model considers more comprehensive text features and properly handles them. In addition, the adversarial multitask learning approach improves the encoding quality of the encoder.
It is noted that multiple deep learning techniques, including CNN and LSTM, were used in the proposed framework. Therefore, the efficiency of the proposed algorithm is not comparable with the state-of-art methods, such as Native Bayes, SVM and KNN. However, we insist that the sentiment classification accuracy has been largely improved. And the time complexity of the proposed method is not the main concern, since the sentiment analysis can always be performed offline As one of the future works, the encoder on the existing basis will be improved to further improve the results of sentiment classification for multi-task learning. Moreover, we intend to perform some multi-class sentiment analysis for other related NLP tasks.
NING JIN received the B.S. and master's degrees in information and electronic engineering from Zhejiang University, Hangzhou, Zhejiang, China, in 1988 and 1991, respectively. She is currently a Professor and the Dean of the Information Engineering College, China Jiliang University. Her research interests include intelligent systems, wireless networks, and signal processing.
JIAXIAN WU received the bachelor's degree from Wenzhou University. He is currently pursuing the master's degree with China Jiliang University. His research interests include artificial intelligence and natural language processing. VOLUME