Transfer Correlation Between Textual Content to Images for Sentiment Analysis

In social media, images and texts are used to convey individuals’ attitudes and feelings; thus, social media has become an indispensable part of people’s lives. To understand social behavior and provide better recommendations, sentiment analysis on social media is helpful. One sentiment analysis task is polarity prediction. Although current research on visual or textual sentiment analysis has achieved quite good progress, multimodal and cross-modal analysis combining visual and textual correlation is still in the exploration stage. To capture a semantic connection between images and captions, this paper proposes a cross-modal approach that considers both images and captions in classifying image sentiment polarity. This method transfers the correlation between textual content to images. First, the image and its corresponding caption are sent into an inner-class mapping model, where they are transformed into vectors in Hilbert space to obtain their labels by calculating the inner-class maximum mean discrepancy (MMD). Then, a class-aware sentence representation (CASR) model assigns the distributed representation to the labels with a class-aware attention-based gated recurrent unit (GRU). Finally, an inner-class dependency LSTM (IDLSTM) classifies the sentiment polarity. Experiments carried out on the Getty Images dataset and Twitter 1269 dataset demonstrate the effectiveness of our approach. Moreover, extensive experimental results show that our model outperforms baseline solutions.


I. INTRODUCTION
As social media thrives, analyzing the sentiments in tweets has attracted increasing attention from researchers. On social media platforms such as Twitter and Facebook, people share their daily lives with images and short texts. To understand social behavior and provide better service to the users, a fundamental task is sentiment polarity classification.
Many tweets consist of two parts, an image and a text that is not long. Therefore, an accurate sentiment classifier must consider the two parts, and multimodal methods or crossmodal methods should be applied. One main challenge of multimodal or cross-modal sentiment analysis is that different modalities have individual semantic features. For tweets, The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . the image and the text may not be correlated, which has a great impact on the classification accuracy.
Methods found to analyze sentiment have been explored in several studies. Existing sentiment analysis methods can be roughly divided into two categories [1] from a methodological perspective: traditional sentimental methods and hybrid sentimental methods.
Traditional sentiment analysis methods classify sentiment mainly by encoding the probability of words with sentiment relations, including keyword detection methods [2], classification and regression models [3], and semantic web methods [4]. Keyword detection approaches are the most widely used. Sentimental polarity is determined by counting the totality of sentimental words appearing in the corpus, such as happiness, sadness, and anxiety. However, such approaches fail to recognize ironic words. Such expositions are unsatisfactory because they only focus on counting sentimental words in the corpus. Generally, people express their feelings such that more euphemistic and keyword detection methods are valid for a specific corpus only. Classification and regression models include support vector machines (SVMs) [5], [6], Bayesian reasoning [7], [8], and artificial neural networks (ANNs) [9]. By training a classifier with a corpus as input, the sentiment intensity of a keyword can be obtained. The above accounts fail to resolve the contradiction in the lack of semantic correlation, which means that other sentiment information related to the corpus is not considered. Methods based on the semantic web depend on ontology or knowledge graphs [10]- [12]. These methods no longer rely on keyword or word frequencies and use large-scale semantic knowledge graphs to mine hidden features between semantic concepts of corpora. Different from the other methods, semantic web models use semantic relations to reveal implicit sentiment and are often used on commercial websites. WordNet-Affect [13] and SenticNet [10] are representative large-scale sentiment knowledge graphs. These knowledge graph construction methods include iterative regression based on common sense graphs and inline regularization random walk algorithms. The error rate of these knowledge graphs is reduced by similarity comparison and the average maximum rate. The intensity value and polarity of all kinds of sentiment words are defined at the same time.
Hybrid sentiment analysis methods encode images and sentences into multidimensional distributed vectors, and then multimodal sentiment polarity is obtained by machine learning classification, which includes supervised, semisupervised and unsupervised learning [14]. The results of hybrid sentiment analysis are achieved mainly by image object feature extraction and multimodal fusion analysis. Among them, sentiment classification methods based on deep learning [15] and generative adversarial networks (GANs) [16] are the most popular.
However, visual sentiment semantics are high-level implicit semantics, which are different from explicit textual expressions; therefore, textual sentiment analysis methods realized by natural language processing have one-sidedness and uncertainty. The research to date has tended to focus on cross-modal image sentiment analysis.
Studies on image sentiment analysis have been mostly restricted to limited comparisons of image global features, and experimental data are rather controversial. In addition, different individuals have different attention and cognition to each region of an image. Accordingly, sentiment analysis on social media sites is still a difficult task. Then, scholars realized the mechanisms by which textual context underpins image semantics are not fully used, and cross-modal approaches were proposed. Cross-modal image sentiment analysis refers to methods that supply a gap between visual semantics and sentiment with textual context. Cross-modal studies offer some important insights into semantic fusion through transfer learning to map image features to textual labels. Then, the purpose of image annotation was achieved, and labels were used as prior knowledge in sentiment polarity classification. Nevertheless, the drawback of existing crossmodal image sentiment analysis methods is that they rely too much on mapped labels to understand the correlation between image content and textual context.
To increase the understanding of image sentiment by exploring the correlation between visual content and textual context, this paper proposes a novel cross-modal model for image sentiment analysis. First, a fine-tuning convolutional neural network (CNN) [17] and GloVe [18] are used to extract the features of an image and its caption. Second, the innerclass mapping model taking visual features as inputs calculates the inner-class maximum mean discrepancy (MMD) with corresponding textual features in the same Hilbert space to obtain their correlations, and then the correlations are represented as labels. Furthermore, the corresponding textual description is embedded into distributed representation by a class-aware attention-based gated recurrent unit (GRU) with redundant information filtered out. Third, an inline relationship between textual context and visual contents is obtained by an attention-based long short-term memory network (LSTM) to estimate the final image sentiment polarity.
The main contributions of this paper include the following: (1) In this paper, a novel cross-modal image sentiment analysis model is proposed. This model extracts visual features and uses them as the attention weight parameter of LSTM to obtain the context image related in the corresponding textual description (caption). This model can be used to predict an image sentimental polarity by utilizing semantic correlation descriptions.
(2) Different from the existing cross-modal sentiment analysis methods, this paper proposes an inner-class mapping method based on unsupervised maximum mean discrepancy (MMD), which attempts to learn cross-modal mapping correlations between images and descriptions.
(3) The end-to-end sentiment analysis algorithm is implemented in this paper. The experimental results show that the precision, F1 and accuracy are improved, and the proposed model outperforms other state-of-the-art image sentiment analysis methods on the Twitter1269 dataset. The feasibility and effectiveness of the model are also validated by a case study.
The remainder of this paper proceeds as follows. In Section 2, the layout methods of previous sentiment analysis research are reviewed, along with how representative cross-modal methods work. Section 3 describes the methodology used for this study. Section 4 analyses the experimental results and discusses some data and examples. Finally, Section 5 gives a brief conclusion.

II. RELATED WORK
Sentiment analysis is becoming increasingly important due to the rise of actual needs in social media platforms. There are some volumes of published studies describing the role of cross-modal research. Much of the cross-modal research has simply focused on identifying and evaluating algorithms based on text features or visual features. VOLUME 8, 2020 A. TEXTUAL SENTIMENT ANALYSIS Previous textual feature-based studies used in cross-modal approaches have explored the connection between textual features and sentiment, such as topic word detection models [19] and sentence grammar layer models [20]. These methods have achieved remarkable results and provided a significant reference for image sentiment analysis. Approaches in the literature can be classified into two categories: aspectbased sentiment analysis (ABSA) and targeted sentiment analysis (TSA).

1) ASPECT-BASED SENTIMENT ANALYSIS
The task of ABSA is to classify sentiment polarity by analyzing words to obtain an aspect. For example, ''this computer is very expensive'', and ''price'' is the aspect. The main challenge of ABSA is to classify the sentiment polarity of a compound sentence with multiple aspect words.
Zainuddin et al. [21] proposed an aspect-based hybrid feature selection method for Twitter. Wang et al. [22] analyzed sentiment by merging the attention in a multilayer neural network. For each word in a sentence, the attention weight shows the most pivotal word in a sentence, and the association degree of a given aspect word is obtained after the dot product. The experimental results show that this method can reduce the training loss caused by the recurrent neural network (RNN), and the accuracy of this multilayer neural network classifier is much higher than that of a single output layer classifier. Liu et al. [23] proposed a model combining regional CNN and LSTM; this model retains content information and time-series relationships between sentences in the whole comment without additional dependency analysis.

2) TARGETED SENTIMENT ANALYSIS
TSA methods extract specific target words in a sentence to analyze the relationship between the target word and some sentiment words through LSTM, such as target dependency LSTM (TDLSTM) and target correlation LSTM (TCLSTM) [24]. TDLSTM matches the hidden output layer of the bi-LSTM encoder with the target word to obtain sentence polarity. TCLSTM is extended by TDLSTM, which encodes each input word with the target word to obtain sentence distributed representation.
Attention is also suitable for TSA. Tang et al. [25] used RNN with multilayer attention weights to obtain classification results under supervised learning. This method improves the weights of important words by multihop training. Because only the weight of one word is increased, this method is unsuitable for sentences with multiple important words. Lu and Wu [26] constructed sentiment dictionaries to automatically extract sentiment words from a corpus to sentiment classifiers. Then, SVM was used to determine the final polarity. Bin et al. [27] proposed a new target-specific sentiment analysis approach based on a multiattention CNN. This method can take parallel text as input and greatly decrease the loss during training time. Additionally, this method can effectively compensate for the deficiency of a single attention layer.

B. VISUAL SENTIMENT ANALYSIS
Visual sentiment analysis is carried out by designing a polarity classifier to analyze visual features [28], [29]. Previous studies have established models including lowlevel feature extraction [30], [31], the semantic feature model [32], [33], and the deep learning framework [34], [35]. These approaches are mainly focused on low-dimensional feature extraction, such as color histograms, and the most typical method is human facial emotion recognition [36]. Human facial emotion is the most obvious sentiment symbol and is easy to identify. However, this kind of method is not applicable in other domains because of the semantic gap between low-level and high-level features.
With the improvement of deep learning, You et al. [37] used a pretrained domain transfer learning approach to analyze sentiment. Ahsan et al. [38] extracted an intermediate visual representation of social event images based on the visual attributes that occur in images going beyond sentimentspecific attributes. Song et al. [39] proposed a multilayer attention network to capture the saliency of the image content region, and sentiment polarity was classified according to the content with saliency. However, this method is effective only for simple images, especially for images containing only one object as content. Dong et al. [40] proposed four shared networks that receive multiple instances as inputs and are connected by a novel loss function consisting of pair-loss and triplet-loss to examine the potential connections among training instances. This method achieves excellent performance on object tracking. In fact, the sentiment analysis application scenario is always complicated.

C. MULTIMODAL SENTIMENT ANALYSIS
Due to the lack of direct mappings from visual semantics to sentiment, social media networks provide other abundant types of information, such as image captions and videos. Several studies [41]- [43] have used multimodal data to construct sentiment classifiers.
It is difficult to directly map images to sentiment. Wollmer et al. [44] and Kumar et al. [45] proposed a model to recognize facial emotion by multimodal feature fusion. Poria et al. [46], [47] proposed sharing state parameters from a CNN model with multikernel learning (MKL). Zadeh et al. [48] proposed a tensor fusion method. Byrne et al. [49] proposed employing simultaneous derivation to facial emotion recognition, but the essence of this method is still based on statistics. Xu et al. [50] proposed a hierarchical deep fusion model to explore the cross-modal correlations among images, texts, and social links, which can learn comprehensive and complementary features for more effective sentiment analysis. Their work is interesting and novel. The drawback of this model is that it is only for specific links, and overall, these links are unreliable on social media. Borth et al. [51], Maurya et al. [52], and Li et al. [53] proposed using sentiment-related adjective-noun pairs (ANPs). By means of ANP extraction, a visual sentiment ontology (VSO) was constructed, such as Sentibank [32] and SenticNet [10]. Similarly, Teng-Jiao et al. [54] proposed an object-sentiment pair extraction method based on middlelevel semantic and grammatical analysis. Most studies on ANPs and VSO have high accuracy.

D. CROSS-MODAL SENTIMENT ANALYSIS
Different from the multimodal approach, cross-modal image sentiment analysis attempts to construct a mapping model based on transfer learning. To solve the lack of labeled training data, transfer learning spreads the knowledge from the source domain to the target domain by finding the similarity rules between data in two domains. Tsai et al. [55] proposed heterogeneous transfer from one modality to another, which is still a one-to-one transfer paradigm. Huang et al. [56] performed knowledge transfer between two domains, and models in two domains both share the same parameters. Schmitter et al. [57] proposed mapping semantic labels by estimating the probability distributions of the source domain to the target domain. Ji et al. [58] proposed a novel bilayer multimodal hypergraph learning (Bi-MHG) for robust sentiment prediction of multimodal tweets. Van Opbroek et al. [59] used kernel learning to calculate the weight parameters of image segments for polarity classification. Wu et al. [60] proposed a multilabel image annotation approach through sharing training structures. Huang et al. [56] calculated the mapping distance between the image feature and the semantic label through the MMD distance square.
Although cross-modal sentiment analysis methods solve the problem of insufficient data through transfer learning, these sentiment classifiers have ignored the semantic inner relation in textual descriptions. Attention [61], [62] weights are available parameters for analyzing inner relations, but attention is only widely used in image object detection.
Inspired by the work [56] on learning joint visual and textual models, this paper relies on the MMD distance to embed similarity between images and description for image annotation. This is different from previously mentioned works, all of which were proposed to extract the descriptive context of a related image for sentiment analysis by a class-aware IDLSTM.

III. METHOD
In this section, we propose our approach for cross-modal sentiment analysis and present its detailed explanation. The architecture of our approach is shown in Figure 1. To finish the task, an image and its corresponding description are fed into our model as the input. First, in stage (a), the image and its caption are processed separately. The image goes through a fine-tuned CNN, and the visual features of the sample are then fetched. The caption text is transformed into a sequence of vectors through word embedding methods, which are the textual features of the sample. In stage (b), a joint mapping model is performed to map the visual and textual features into the same Hilbert subspace. This is the key procedure of our approach on which the eventual classification accuracy greatly depends. Finally, in stage (c), class-aware sentence representation (CASR) is carried out, and the inner-class dependency LSTM (IDLSTM) is responsible for the sentiment polarity classification. Section III. A will provide more information about how the visual and textual features are extracted in stage (a). Section III. B will explain in detail the joint mapping model of stage (b). Section III. C will illustrate the procedure of CASR and IDLSTM.

A. VISUAL AND TEXTUAL FEATURE EXTRACTION
The input image and its caption are processed separately to obtain visual features and textual features. The image is analyzed by a fine-tuned VGGNet-16 [17], and the descriptive sentences are transformed into vectors through GloVe [18].
In the fine-tuned VGGNet-16, each convolutional feature map f i corresponds to one specific region in the image, where N is the number of feature maps, and D I is the representation dimension for each region. Specifically, the extracted image feature maps F I from a raw image I through VGGNet-16 are denoted as follows: In this case, the input image is fed with a resolution of 225 × 225 into VGGNet-16. The output of the Conv5_3 layer is 15 × 15, and the dimension N is 512.
The description of a given image is represented as S = [w 1 , . . . , w i , . . . , w L ], where w i is each word of a sentence, and L is the maximum number of words in the description. Each word w i is embedded as a 300-dimensional GloVe [18] word vector v i ∈ R 300 , and the sentence is represented as

B. JOINT MAPPING MODEL
Transfer learning is applied in our approach to construct the correlation between image objects and labels. Previous crossmodal approaches learn global domain shifts by projecting all visual and textual features in both domains into a single subspace, which causes the absence of intra-affinity within classes [63]. To address the problem, our approach utilizes the intra-affinity of classes shared by both visual and textual domains, and an inner-class mapping model (IMM) is therefore proposed.
In the training stage, there are two domains, a labeled source domain D s = {(x i , y i )} n s i=1 and an unlabeled target , where x i , x j ∈ R d 1 are visual features. The source and the target visual vector, whose vector spaces are denoted by X s , X t , respectively, should share the same vector space X but are subject to different distributions. Similarly, the source and the target textual vector, whose vector spaces are denoted by Y s , Y t , share the same VOLUME 8, 2020 vector space Y but are subject to different distributions where y i ∈ R d 2 represents the textual features. During the training process, the domain shifts with an increase in the number of image samples. Therefore, this paper assumes that when the marginal distribution P(X s ) =P(X t ) and the conditional dis- The similarity of the visual and textual features in both D s and D t is the primary consideration for transfer learning. In our approach, maximum mean discrepancy (MMD) [64] is utilized to learn the potential features in the reproducing kernel Hilbert space (RKHS). The MMD between domains is formulated as: where H is the RKHS, and φ(·) is the unified transform function. Then, the original sample feature vectors are mapped to RKHS.
To make use of the intra-affinity of classes shared by visual and textual features, this paper improves transfer component analysis (TCA) [65] and proposes an interclass distance. The distance between classes is measured as: where c ∈ {1, 2, . . . , C} denotes classes, D c s , D c t represent feature sets belonging to class c in the source and target domains, and n c s , n c t are the number of feature vectors belonging to class c in the source domain and target domain, respectively. The factors in (4) are used to average the MMD distances of all features in the same class and prevent them from being influenced by individuals.
The TCA approach converts the data of the two domain spaces to a new Hilbert space to reduce the difference and solves the semidefinite programming problem (SDP) by constructing the kernel matrix. Then, the MMD distance and the kernel matrix are written as: where L c is an MMD matrix, and K is a kernel matrix constructed by the inner product of the mapping. A transformation matrix W ∈ R (n 1 +n 2 )×m converts the data from the original space into the RKHS, where m d is the dimension of RKHS. Equation (4) is converted by TCA to: For the trace optimization problem, the minimum MMD distance needs to be determined by kernel tricks, and the  (8). tr W T W represents a regularization term, and is the trade-off factor to ensure that the model is well defined. Constraints W T KHKW = I are used to maintain the data variance, where I ∈ R m×m is an identical matrix: The Lagrange multiplier is used to solve (8) as: Equation (9) is nonconvex and can be finally formalized as a generalized eigendecomposition problem by setting derivative ∂L ∂W = 0: Finally, the minimum m of the generalized eigendecomposition in (11) is taken to obtain the transformation matrix W . The optimized algorithm is shown in Table 1, and the locally optimal solution is obtained when m converges by iterations.

C. SENTIMENT CLASSIFICATION
In this section, a sentiment classifier is introduced. The sentiment classifier is composed of two models: a class-aware    [18] word vector. Then, the description of the given image is represented as S ∈ R L×D , and the class word is represented as c i ∈ R D . Using the literature [31] for reference, each word w i is associated with a given class c i to form a sequence, and the sentence S is represented as: The distributed representation S c i is then fed to a GRU for context propagation, followed by an attention layer to obtain the class-aware sentence representation. The GRU is VOLUME 8, 2020 formulated as follows: where h t is the output of the hidden layer, s t is the cell state at time t, σ and tanh are activation functions, z is the update gate, and r is the reset gate. This step is represented as follows: The whole step of GRU is represented as The sentiment polarity and intensity of each word are different. To highlight the sentimentally relevant words to class c i , an attention layer is added to capture the weight of each word: where z is the output of the update gate, z The distributed representation combined attention weight is formulated as r c i ∈Rnn D s , W s ∈R D s ×1 , and b s is a scalar.

2) INNER-CLASS DEPENDENCY LSTM
The labels of a given image are obtained, as is the class-aware sentence representation r c i in former sections. The aim of this section is to reinforce the descriptive context related to an image. This paper proposes an LSTM to model the dependency of label vectors with other word vectors in descriptive sentences by increasing the weight of words associated with image labels. The sentiment classifier IDLSTM consists of 2 partitions, as shown in Figure 2.
The mapped image label was normalized as the query q for further memory networking. To reduce the loss caused by the inconsistency between the label and the description in the morphology, class c i is concatenated to image label q as q = q ⊕ c i ∈ R 2D . The distributed representation Q is supplied as the memory slot in Figure 2(a).
where Q = [Q 1 , Q 2 , . . . , Q M ] ∈ R M ×1 and attention weight β = [β 1 , β 2 , . . . , β M ] ∈ R M ×1 . Each β i is a strength value of the match between each word and a label. Each word in the sentence is represented by the corresponding class-aware sentence representation. Considering that the memory value is usually too small and easy to forget, this paper uses attention-based LSTM LSTM att with size D o to predict the correct classification of these words. As shown in Figure 2 where the parameters of LSTM at are U z The response vector o is obtained by summing output vectors in Q , weighted by the relatedness measures in β: where o∈R D o . In the final stage, distributed representation q related to the image is added to the memory output of o to generate the predicted value.
where W Smax ∈ R D o ×C , b smax ∈ R C , and the maximal value ofŷ is taken as the prediction. Table 2 shows the algorithm of all steps.

3) LOSS FUNCTION
In this paper, the memory network is trained for 30 epochs using cross entropy with L2-regularization as the loss function.
35282 VOLUME 8, 2020 where n is the number of samples, i is the sample index, k is the class value, λ is the regularization weight, and λ = 10 −4 . The optimization algorithm uses the ADAM algorithm [66] based on stochastic gradient descent (SGD). Its parameters are obtained by adaptive learning, and the learning rate is 0.001.

IV. EXPERIMENTS
In this section, experiments are carried out to demonstrate the effectiveness of our model. First, in Section 4.1, the two datasets on which the experiments are conducted are introduced. Then, Section 4.2 presents the evaluation metrics, and baseline methods are described. To obtain a fine-tuned CNN, a large number of labeled images are needed. The main reason to use Getty Images is that the dataset is already labeled and contains images, labels and relatively formal image descriptions. While on social media sites, different people may have different descriptions of the same objects, which makes it harder to have a well-labeled training dataset. To improve the accuracy and robustness of our model, we propose using weakly labeled data to train our model in our implementation. We use a list of classes for both sentimental predictions. The list consists of 368 classes and their polarity. Then, query results of images and texts on the Getty Images website are obtained to construct an experimental dataset in line with the classes, and the final weakly labeled dataset that contains 10,496 images and texts is obtained.

2) TWITTER 1269
The Twitter1269 dataset is an open access dataset proposed in [71]. This dataset is a popular image sentiment benchmark and is composed of 1,269 images collected from Twitter. Each image in the dataset was manually labeled by five Amazon Mechanical Turk (AMT) workers as strongly positive (2), positive (1), negative (−1) or strongly negative (−2). These images were ranked according to the sum of their scores by 5 AMT workers and then divided into three confidence level batches: High confidence (5 agree) images: contains 882 images, five workers all labeled the same sentiment for an image, of which 581 were positive, and 301 were negative.
Midconfidence (4 agree) images: contains 1,116 images labeled by at least four staff with the same sentiment, of which 689 were labeled as positive and 427 as negative.
Low confidence images (3 agree): contains 1,269 images labeled by at least three staff with the same sentiment, of which 769 were labeled as positive and 500 as negative.

B. EVALUATION METRICS AND BASELINES
There are four main evaluation protocols widely used in image sentiment analysis: precision (pre), recall (rec), F-measure (F1) and accuracy rate (Acc). The following open source baselines are used to compare with this model for performance evaluation: Single textual model: A single textual model is a sentiment analysis method based on textual features. Tan et al. [67] proposed a model using multikernel learning to extract text features as the input of a support vector machine to analyze sentiment polarity. Le and Mikolov [68] proposed an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts.
Single visual model: A single visual model refers to a sentiment analysis model based on visual features. Siersdorfer et al. [69] proposed a model using low-level visual features extracted by a global color histogram (GCH) for sentiment classification. You et al. [37] proposed a progressive CNN (PCNN) method, which uses CNN to extract visual features for regression.
Multimodal model: The multimodal model in this paper refers to a sentiment analysis method consisting of more than one model to extract the feature vector from different datasets. Then, the final polarity is determined by the voting mechanism. Borth et al. [51] proposed the Sentibank method. Sentibank is a method carried out on VSO constructed by extracting the ANPs from relevant descriptions. Yuan et al. [70] proposed Sentribute to predict image sentiment polarity by using middle-level attributes combined with voting.
Cross-modal model: The cross-modal model uses textual features as a supplement to image sentiment analysis. You et al. [71] proposed a transfer learning model based on information entropy to label images. Our proposed crossmodal model uses the MMD to label images and an extra attention-based LSTM to classify sentiment polarity.

C. INNER-CLASS MAPPING PERFORMANCE
We test the inner-class mapping model in this part. First, the image feature vectors and labels are obtained by pretrained CNN and GloVe, respectively. Then, the images in the training set are used as the source domain, and the minimum MMD distance is calculated with the corresponding labels.
In the validation stage, the mapping performance is verified with cross-verification. The experimental results are the average of experiments performed 10 times, and the dataset is shuffled before each experiment. The Getty Images dataset is randomly divided into two partitions of 80% and 20%, and 80% partitions are used as the training set and 20% partition test set.
Experiments are carried out on a workstation with Ubuntu16.04(X86_64) and a NVIDIA GTX1060 GPU. Specifically, the CNN model is initialized using a pretrained 16-layer VGGNet, which includes 13 convolutional layers and 3 fully connected layers, on the Getty Images dataset  to extract image features. The feature maps are from the conv1_1 layer and the conv4_3 layer, and the output of the conv5_3 layer is used as the input to the inner-class mapping model. The learning rates of the convolutional layers and the last fully connected layer on the classification branch are initialized as 0.001 and 0.01, respectively. All parameters of visual and textual components can be jointly optimized. Experientially, unsupervised learning is used in training with = 1 [63], and the number of training iterations is 10,000 with the GPU. The performance of the inner-class mapping model is shown in Table 3.

D. PERFORMANCE ON GETTY IMAGES
In experiments, the performance of the proposed model is verified with cross-verification. The experimental results are the average of 10 experiments, and the dataset is shuffled before each experiment. The Getty Images dataset is randomly divided into two partitions by 80% and 20%, and 80% partitions are used as the training set and 20% partition test set.
Before validating the proposed model on Getty Images, we saved image descriptions as a file for preprocessing. Preprocessing [72] includes three steps: 1) numbers and special characters in the description are removed; 2) the description file is tokenized with the tokenizer NLTK; 3) words that appear fewer than 5 times are removed and the dimensions of textual feature vectors are limited to 300. Then, the obtained images are divided into several batches for training and testing. In each batch, the parameters of visual and textual components can be jointly optimized. To balance the memory load and convergence rate, each batch size is set to 1,000, and the learning rate is set to 0.01.  Figure 3 illustrates the change in variance in the loss function as the number of batch iterations increases. The results obtained from the preliminary analysis are that the variation in the loss function and the number of iterations is inversely proportional on randomly selected batches. The loss function converges after approximately 10 iterations.
As shown in Table 4, the recall of single text feature models [67], [68] is generally lower than that of other models, and the recall of the single visual model Siersdorfer et al. [69] proposed is the highest (84.0%). The cross-modal model You et al. [71] proposed has the highest precision of 84.6%. The precision of our model is similar to that of You et al. [71]. The recall of our model is the second highest and is approximately 5% lower than that of the single visual model Siersdorfer et al. [69], but F1 reaches 81.0%. The F1 and accuracy of our model are both better than those of other baselines. The cross-modal model You et al. [71] carried out image annotation based on information entropy according to labeled images for sentiment analysis. In this paper, the attention-based LSTM is used to obtain the sentiment word to analyze image sentiment. Hence, compared with You et al. [71], the precision of our model is approximately 1.4% lower than that of You et al. [71], but the advantage of our model is that it can be applied to decompose complex sentences in image descriptions to obtain the image sentiment polarity.

E. PERFORMANCE ON TWITTER1269
In this section, the model proposed in this paper is validated on three batches of Twitter1269 images, and the performance is compared with baselines where the experiments were carried out on the same dataset. Figure 4 illustrates the experimental data on Twitter1269.
As seen in Table 5, the precision, recall, F1, and accuracy of all models decrease with the decline in image confidence, and it is apparent from this table that the subjectivity of different individuals has a great influence on the judgment of image sentiment. However, even if the subjectivity is considered, our model maintains a considerable advantage in precision, F1 value and accuracy. The results of the correlational analysis show that all the evaluation protocols of our model are better than those of the single text feature model. Even though the recall of two single visual models, Siersdorfer et al. [69] and You et al. [37], are both very high, the overall   classification precision is obviously lower than that of our model. The results from You et al. [37] can be compared with the data from our model, which shows that the precision, recall and F1 are approximately 9, 6 and 1.9% lower, respectively. These results suggest that our model uses the visual and textual features jointly to classify the sentiment polarity together and has obvious advantages.

F. CASE STUDY
This section indicates how the proposed model works through some cases.

1) CASE 1: SENTIMENT ANALYSIS ON RANDOM IMAGES
Images in the Twitter1269 dataset were labeled by AMT staff with negative sentiment as '0' and positive sentiment as '1'.
The final image polarity was determined by the predicted probability of IDLSTM. To evaluate the performance of our model directly, six images with different confidence levels are selected as samples with indexes of 1 to 6, and the image descriptions are manually added, as shown in Figure 4.
In Figure 4, images 1, 2, and 5 are positive samples with high confidence levels, image 4 is a negative sample with high confidence levels, and images 3 and 6 are negative samples with low confidence levels. Because the sentiment polarity of the low confidence samples is subjective and uncertain, the high confidence samples are chosen in this paper. Y=0.5 is set as the reference line in the bar chart. The experimental results are shown in Figure 5.
The bar chart in Figure 5 shows the experimental prediction of different models to samples. It is apparent that a single VOLUME 8, 2020 textual model only considers text characters. The textual descriptions of images 2, 3, 4, and 6 are relatively negative, and the predictions of single textual models are all negative, whereas image 2 is actually positive. This means that the single textual model has limitations in social media analysis.
For a single visual model, only visual features are considered; images1, 2, 3, and 6 with distinct colors are identified as positive samples, and images 4 and 5 with gloomy colors are predicted as negative samples. It is obvious that the prediction of image 5 is incorrect. Multimodal model Borth et al. [51] predicted image 2 as a negative image, whereas the ground truth of image 2 is positive. A possible explanation for this result might be that Borth et al. [51] relies deeply on ANPs. If there are no ANPs in the description or the description does not follow normal syntax, the result of Sentibank will be biased. The multimodal model Yuan et al. [70] gives a positive polarity to image 3, whereas image 3 is actually a negative sample. What is notable in Yuan's model [70] is that the essence of the Sentribute method is to train 102 classifiers to distinguish image content attributes. Each classifier is related to one kind of sentiment image scene label; the final result is obtained by voting. However, with a small training size, caution must be applied, as the Sentribute might not be able to recognize all attributes. With the model proposed in this paper, the results of images 1 and 2 are positive, and those of images 3, 4, and 6 are negative. All predictions are correct. In addition, for image 5 with gloomy visual color and a positive textual description, the predicted result is a probability close to 0.5.
However, the accuracy of the inner-class mapping model is not high enough for complex image annotation. The results of this case indicate that the accuracy does not affect the correctness. In particular, this case confirms that our model is suitable for social media sentiment analysis. Further work is required to improve the accuracy of the inner-class mapping model.

2) CASE 2: VISUALIZATION OF ATTENTION
In this section, case 2 is designed to show how the attentionbased IDLSTM works by visualization [73]. For this purpose, the output of IDLSTM in each iteration is captured.
Because words in the sentence might belong to the same class, attention weights are appropriate for finding the words related to image labels in the sentence.
As shown in Figure 6, an image randomly selected with the caption ''coffee is better than tea''. ''coffee'' and ''tea'' in the sentence belong to the same class, in which the sentiment ''coffee'' comes from ''better''. When ''coffee'' and ''tea'' with the same class coexist, it is necessary to use the image context as a priori knowledge. Our model repeatedly compares the ''coffee'' linked to the image with other words in the sentence by the attention-based IDLSTM and finally highlights sentimental words as the classification result. Figure 6 shows the visual results of changing weight in the IDLSTM classification stage. For the attention weight, the attention weight is reflected by the background color of  the words. The deeper the color is, the more important the word is. Before the iteration begins (iteration 0), the weights of every word in the image description are distributed equally. With the incremental iterations, the attention weight changes obviously. In the third iteration (iteration 3), ''coffee'' and ''tea'' are highlighted, indicating that class is effectively introduced. In the 10th iteration (iteration 10), the highest weight is assigned to the word ''better'', indicating that weight is changing at the word level, and attention-based IDLSTM can dynamically highlight sentimental words of the entire sentence to make the correct classification.

3) CASE 3: ABLATION EXPERIMENTS
In this case, ablation experiments are carried out to quantify the effectiveness of the inner-class mapping model (IMM) and inner-class dependency LSTM (IDLSTM) introduced in this paper. The proposed model is retrained by ablating the following components on the Getty Image dataset: (1) Visual sentiment classifier, where only the image feature is considered. To further study the effect caused by visual features, the IDLSTM is ablated. The visual sentiment classifier (CNN+IMM) consists of CNN and IMM. The outputs of the visual sentiment classifier are image labels, and the polarity of an image is finally calculated by summing the polarity of each image label.
(2) Textual sentiment classifier, where only the textual feature is considered. To study the effect caused by IDLSTM, CNN and IMM are ablated. The output of IDLSTM is the polarity of the description.
Because of the uncertainty of Twitter1269, three models are tested to quantify the effectiveness on Getty Images. Table 6 illustrates the performance of ablation experiments. Compared with CNN+IMM, it is obvious that IDLSTM can improve the performance of the model. The recall of the whole model is lower by more than 1%, but the precision, F1 and accuracy are higher by 3.2, 4, and 5%, respectively. Compared with the single IDLSTM, the performance of the whole model is better than that of the single IDLSTM. The precision, F1 and accuracy are higher by 5%, 2%, and 1%, respectively. This means that the inner-class mapping model is useful. Consequently, it can be concluded that there exists innerclass dependency, and the exploitation of the correlations between image and description is conducive to more effective cross-modal image sentiment classification.

4) CASE 4: THE INFLUENCE OF DIFFERENT INPUT SEQUENCES
LSTM carries out sentiment analysis based on memory slots that are influenced by the input sequence. In fact, the output of LSTM is affected by the input sequence.
In this experiment, 40 images from the Twitter1269 dataset with high confidence levels were randomly selected as the samples, index 1 to 40, and image descriptions were also manually added. The class [41] order was sorted randomly, and four sorted queues with different orders were taken as inputs of IDLSTM. In Figure 7, the horizontal axis represents the image index, and the vertical axis is the positive probability of an image. Figure 7 illustrates the positive probability of each image predicted.
As seen in Figure 7, the four output curves are different according to the four different input class sequences. Although the differences in the input class sequences leads to a certain deviation in the predicted probability, there is only a small-scale fluctuation in each curve and no instances in which the polarity prediction is upside down on the reference line due to the difference in the input sequence of IDLSTM.
It is shown that the output of the proposed model is independent of the input class sequence of IDLSTM.

V. CONCLUSION
In this paper, a joint visual-textual cross-modal sentiment analysis model is proposed. This model extracts visual object features and uses them as the attention weight parameter of LSTM to obtain the context image object related in the corresponding textual description. The sentiment polarity of an image is then obtained. This model can not only solve a multiobjective image sentiment analysis but also improve the utilization rate of semantic correlation descriptions. In experiments, the Getty Images and Twitter1269 datasets are used to validate the proposed sentiment analysis model. The results show that the proposed model outperforms existing state-of-the-art models on social media image datasets.
However, there are still some unsatisfactory problems in the operation of the model in experiments, such as memory overhead, long system runtime, and limitations in some special application scenarios. Future research should be undertaken to investigate the following. 1) Improvement of the precision of the inner-class mapping model on transfer learning. For example, using a knowledge graph to provide prior knowledge for target text feature mapping. 2) Model parameter optimization and structure reconstruction. 3) The application to other domains, such as audio-video domain adaptation. YONGHUA ZHU was born in Zhejiang, China, in 1967. He received the B.S. degree from the Department of Information and Control Engineering, Xi'an Jiaotong University, the M.S. degree from the Department of Electrical Engineering, Shanghai Tongji University, and the Ph.D. degree from the School of Communication and Information Engineering, Shanghai University. Since 2010, he has been a Teacher with the School of Computer Engineer and Science, Shanghai University, where he is currently an Associate Professor and a Supervisor with Shanghai Film Academy. His research interests include software engineering and machine learning, especially in interdisciplinary sentiment analysis. VOLUME 8, 2020