Multidimensional Extra Evidence Mining for Image Sentiment Analysis

Image sentiment analysis is a hot research topic in the field of computer vision. However, two key issues need to be addressed. First, high-quality training samples are scarce. There are numerous ambiguous images in the original datasets owing to diverse subjective cognitions from different annotators. Second, the cross-modal sentimental semantics among heterogeneous image features has not been fully explored. To alleviate these problems, we propose a novel model called multidimensional extra evidence mining (ME2M) for image sentiment analysis, it involves sample-refinement and cross-modal sentimental semantics mining. A new soft voting-based sample-refinement strategy is designed to address the former problem, whereas the state-of-the-art discriminant correlation analysis (DCA) model is used to completely mine the cross-modal sentimental semantics among diverse image features. Image sentiment analysis is conducted based on the cross-modal sentimental semantics and a general classifier. The experimental results verify that the ME2M model is effective and robust and that it outperforms the most competitive baselines on two well-known datasets. Furthermore, it is versatile owing to its flexible structure.


I. INTRODUCTION
With the rapid development of social media, we prefer to upload videos, audios, images, and texts to blogs (or microblogs or WeChat) to express our personal emotions. For example, bright-colored images usually represent good (or positive) emotions while dim images often represent bad (or negative) emotions. Apart from texts and audios, images contain much valuable sentimental semantics owing to plentiful visual information, which can be utilized to complete many significant things. On one hand, timely psychological intervention is feasible for the depressed persons if we can accurately capture their emotions based on their social media information. On the other hand, we can use the sentiment The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S Raval .
preferences predicted by a model to complete popularity predictions. According to these predictions, people will make smarter decisions than before. In summary, image sentiment analysis has very important significance in our lives. We propose a novel model called ME 2 M for image sentiment analysis. ME 2 M stands for multidimensional extra evidence mining. The ME 2 M model attempts to use many valuable evidences including ''new image features,'' ''refined samples,'' and ''cross-modal sentimental semantics'' to build an effective classification model. Conceptually and empirically, the main contributions of this paper can be summarized as follows: (1) We propose a new sample-refinement strategy to choose high-quality images for model training. The strategy is also a data-augment method, which can enrich the original data and build a data foundation for image sentiment analysis. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ As a by-product of this study, we obtain two new datasets with more explicit and definite sentimental semantics, which are Twitter 1 * and FI * (Flickr and Instagram).
(2) We fully mine the implicit cross-modal sentimental semantics among heterogeneous image features by the stateof-the-art discriminant correlation analysis (DCA) [1] model. The cross-modal sentimental semantics can improve the final classification performance.
(3) Based on the above-mentioned concepts, we propose a novel model called ME 2 M for image sentiment analysis. This model outperforms most competitive baselines. Besides the ME 2 M model, we propose its multiple variants. They are also effective for image sentiment analysis.
(4) Experiments prove that the ME 2 M model is versatile owing to its flexible structure. Other useful evidences such as new features, cross-modal analysis results, and robust samples can be absorbed into it to achieve improved classification performance.
The remainder of this paper is organized as follows: Section II presents related works and our research motivations. The ME 2 M model is described in Section III. Experiments on two well-known datasets and results are illustrated in Section IV. Finally, Section V provides the conclusions and future scope of the work.

II. RELATED WORKS
The correlated works of this study consist of three aspects: image sentiment analysis, image feature learning, and crossmodal analysis. Therefore, a brief overview of the corresponding research works is presented here.

A. IMAGE SENTIMENT ANALYSIS
As an interdisciplinary technology in the fields of computer vision, pattern recognition, and cognitive science, image sentiment analysis has attracted extensive attention recently. Cambria [2] indicated that understanding emotions is an important aspect of personal development and growth. Cambria et al. [3] proposed sentic blending which enables the continuous interpretation of semantics and sentics. Ghiasi et al. [4] proposed a quantitative sentiment calculation method by introducing fuzziness and context dynamism on languagespecific lexicons found in published short informal textual content on three holistic contexts of owner, user, and the post itself. Siersdorfer et al. [5] proposed a machine learning algorithm to predict the sentiment of images through pixel-level features. However, sentiment is a highly abstract concept, and therefore, it is more appropriate to identify the sentiment of images through object-level or attribute-level information. Borth et al. [6] employed visual entities or attributes as new features to complete image sentiment analysis. Evidently, image features including hand-engineered and deep learningbased are significant for image sentiment analysis. Inspired by psychological theories and artistic principles, Machajdik and Hanbury [7] used a set of low-level features including color and texture to represent sentimental content. Zhao et al. [8] proposed a multi-task hyper-graph learning method to predict personalized sentiment in images. Ko and Kim [9] extracted color and scale-invariant feature transform (SIFT) features from images to train a probabilistic latent semantic analysis (pLSA)-based model for image sentiment analysis. Convolutional neural network (CNN)-based features or some key parts (patches) from the whole image played crucial roles in image sentiment classification. Zhu et al. [10] combined a weakly supervised learning strategy with a CNN model to complete end-to-end sentiment prediction. Durand et al. [11] aligned the corresponding image patches for gaining spatial invariance and learning localized features, which helps to improve the final performance. Yang et al. [12] presented a weakly supervised coupled convolutional network to leverage the localized information for image sentiment analysis. However, single modality is insufficient for image sentiment analysis. Ange et al. [13] proposed a user-sensitive deep multimodal architecture, which takes advantage of deep learning and user data to extract a rich latent representation of a user. The architecture consists of a combination of a long shortterm memory (LSTM), LSTM-AutoEncoder, CNN, and multiple deep neural networks (DNNs). Yu et al. [14] proposed an entity-sensitive attention and fusion network (ESAFN) for entity-level sentiment detection in social media posts. Ji et al. [15] proposed a novel bilayer multimodal hypergraph learning (Bi-MHG) for robust sentiment prediction of multimodal tweets. Zhang et al. [16] recommended a cross-modal approach that considers both images and captions in classifying image sentiment polarity. This method transfers the correlation between textual content to images. You et al. [17] used other modalities such as text and emoji, to supply additional information for image sentiment analysis. You et al. [18] trained a LSTM model to complete image sentiment classification. This model used a sentence semantic tree to absorb sentimental semantics from texts. To completely mine the sentimental semantics between images and texts, You et al. [19] put forward a new cross-modality consistent regression (CCR) model, which can extract the implicit semantics between two modalities based on CNN and distributed paragraph vector model (DPVM).
Succinctly, image sentiment analysis is an appealing research field with great potential and high practicality. However, single feature is insufficient for image sentiment analysis. Fully using the implicit cross-modal sentimental semantics among heterogeneous image features represents the future trend of this research field. This means that we use the broad sense of heterogeneous features to complete crossmodal sentimental semantics mining. The apparent advantage of this technique is that additional texts or other modalities are not required and the ME 2 M model only needs images to complete image sentiment analysis.

B. IMAGE FEATURE LEARNING
Image feature learning is a fundamental task in the field of computer vision. In this study, we use nine heterogeneous image features (i.e., SIFT [20], generalized search trees (GIST [21]), local binary pattern (LBP [22]), visual geometry group (VGG-16, VGG-19 [24]), residual neural network (ResNet-50, ResNet-152 [25]), and dense convolutional network (DenseNet-121, DenseNet-161 [26])) to characterize images from diverse visual perspectives. Different image features, including SIFT, GIST, LBP, red green blue (RGB), histogram of oriented gradients (HOG [23]), and hue saturation value (HSV), are often used to depict images. The traditional features have played key roles in most of the computer vision applications for nearly two decades. Among them, SIFT is the prominent feature that can capture local shape variations. GIST can effectively describe the global textural variations, whereas HOG is a well-known feature to represent local gradient variations. However, with the rapid development of feature engineering, other popular feature learning models, such as efficient match kernels (EMK) [27], hierarchical kernel descriptors (HKDES) [28], sparse coding (SC) [29], deep belief nets (DBN) [30], and CNN [31], have been employed to process images more effectively. For example, the EMK model maps the original SIFT descriptors into a kernel space to improve its discriminant ability. The state-of-the-art CNN models use many convolutional and pooling layers to extract deep-level semantics in images. Hence, a robust image feature learning method is essential for image sentiment analysis. Furthermore, we should consider the complementarity among the image features. Robust features can build a strong foundation for the subsequent crossmodal sentimental semantics mining.

C. CROSS-MODAL ANALYSIS
As described earlier, several recent works have focused on completing cross-modal analysis among heterogeneous features. A brief review of this field is presented. Canonical correlation analysis (CCA) [32] is a traditional cross-modal analysis model that has been used in many real applications. The CCA model reflects the overall correlations between two heterogeneous modalities. Kernel canonical correlation analysis (KCCA) [33] is a variant of the CCA model that attempts to solve nonlinear mapping through kernel tricks. Recently, Uurtio et al. [34] proposed a sparse non-linear canonical correlation method (gradKCCA). This model can derive the implicit nonlinear relations through a kernel function, which means that it does not depend on kernel matrix. The DCA model is another improved version of the CCA model. It not only maximizes the correlations of the corresponding feature components between two heterogeneous features but also weakens the correlations of feature components belonging to different categories within the homogeneous features. The mapped features are more discriminative and compact. Owing to its low dimension, the DCA model can improve the real-time efficiency and reduce the risk of over-fitting when processing a large amount of data. The traditional models are based on shallow frameworks, and hence, deep-level semantics cannot be effectively mined.
Owing to the breakthrough of the deep learning technology, researchers have proposed many deep learning-based models for cross-modal analysis. Andrew et al. [35] pro-posed the deep canonical correlation analysis (DCCA) model, which combines the DNN with the CCA model to complete deep-level correlation analysis. Feng et al. [36] proposed a correspondence autoencoder (Corr-AE) by correlating the hidden representations of two uni-modal AEs. Peng et al. [37] proposed a cross-media multiple deep network (CMDN) to explore cross-media correlations. However, to avoid overfitting and obtain better performance, these deep learningbased methods require more training samples.
In summary, a robust and effective cross-modal analysis model is inevitable for image sentiment analysis. The DCA model is selected to complete cross-modal sentimental semantics mining considering its efficiency and effectiveness.

D. OUR MOTIVATIONS
Based on the above-mentioned reviews, we divide our research motivations into three because the ME 2 M model should address three vital issues. The first motivation is derived from the perspective of dataset. High-quality images are necessary for training a robust classification model. Although researchers have proposed some correlated datasets, unambiguous images are scarce owing to different subjective cognition from different annotators. An effective data augment or sample-refinement strategy should be designed to resolve this problem. The second motivation is derived from the perspective of heterogeneous image features. Single image feature is insufficient for image sentiment analysis. Different image features usually characterize sentimental semantics from diverse visual aspects. They complement each other to enhance the final performance. The third motivation is derived from the perspective of crossmodal analysis. Image features (i.e., texture, shape, color, and edge) often point to the same or similar sentimental semantics. Cross-modal sentimental semantics can be applied to characterize images more accurately.
To sum up, each motivation brings us a valuable evidence for creating the ME 2 M model and helps to improve the final classification performance.

III. THE ME 2 M MODEL A. MODEL FRAMEWORK
As previously established, the ME 2 M model consists of three key components: ''image feature learning,'' ''samplerefinement (SR),'' and ''cross-modal sentimental semantics mining (CSS)''. We use the FI dataset to illustrate our idea in Figure 1  As shown in Figure 1, first, we design a new samplerefinement (data augment) strategy to refine the ''Pending Data,'' which gives the first evidence to the ME 2 M model. Next, we extract a set of complementary image features including traditional image features (i.e., SIFT(S), GIST(G), and LBP(L)) and state-of-the-art deep learning-based features (i.e., DenseNet121(D121), DenseNet161(D161), VGG16(V16), VGG19(V19), ResNet50(R50), and ResNet152(R152)), which gives the second evidence to the ME 2 M model. The traditional features are relatively sparse, which usually depict the key shape, texture, and color information in images, whereas the deep learning-based features are relatively dense that contain plentiful deep level semantics in images. They can complement each other effectively. Finally, we use the DCA model to complete cross-modal sentimental semantics mining, which gives the final evidence to the ME 2 M model. Based on the mined cross-modal sentimental semantics and a general classifier, we train a classification model for image sentiment analysis.

B. IMAGE FEATURE LEARNING
A robust image feature learning method is essential for image sentiment analysis. It will build a foundation for the subsequent cross-modal sentimental semantics mining. To achieve the goal, we consider the discriminant ability and the complementarity of each feature. Thus, SIFT (S) (a shape feature of several good characteristics), GIST (G) (a texture feature for characterizing global texture), LBP (L) (a traditional shape feature for describing local image patches), VGG19 (V19) (a state-of-the-art deep learning-based feature that is an excellent complementarity to the above-mentioned traditional features), VGG16 (V16) (similar to V19), ResNet152 (R152) (a well-known deep learning-based feature that has heterogeneous structure in comparison with other CNN models), ResNet50 (R50) (similar to R152), DenseNet161 (D161) (a well-known deep learning-based feature that has heterogeneous structure in comparison with other CNN models), and DenseNet121 (D121) (similar to D161) are chosen to complete image feature learning. These features characterize images from different visual perspectives. Particularly, they establish a vital premise for the subsequent cross-modal sentimental semantics mining owing to heterogeneous structures.

C. SAMPLE-REFINEMENT STRATEGY
We need to refine the original dataset to obtain high-quality samples for image sentiment analysis. The key aspects of the sample-refinement strategy are as follows: first we obtain a data subset called D 3−5 from the original dataset D 3 (D 3 means at least 3 Amazon Mechanical Turk (AMT) workers give the same sentiment label to an image. Similarly, D 5 means all the 5 AMT workers give the same sentiment label to an image. Compared to D 3 , D 5 is a high-quality dataset. We remove the D5 data from the D3 dataset to get D3-5). Then, we use a group of classifiers to make the predictions of the high-quality dataset (D5). Because there may be some sentimental ambiguities in images, we use nine classifiers: k-nearest neighbor (KNN), logistic regression (LR), random forest (RF), decision tree (DT), naive bayes (NB), adaptive boosting (AdaBoost [38]), gradient boosting decision tree (GBDT [39]), extreme gradient boosting-20 (XGBoost-20 (20 weak classifiers)), and extreme gradient boosting-40 (XGBoost-40 (40 weak classifiers) [40]) to make the final predictions of the images in the D5 dataset by soft voting. This helps to eliminate the sentimental bias of a single classifier and obtain the cross-modal semantics with relatively definite sentiments. Finally, according to the cross-modal semantics and the D 5 dataset, we train an image classification model and obtain refined samples (D sr , where ''sr'' stands for sample-refinement) from D 3−5 . Subsequently, we merge the D sr into the D 5 dataset, resulting in two novel datasets: Twitter 1 * and FI * . To enhance the effectiveness and efficiency of the proposed sample-refinement strategy, we use the cross-modal analysis idea to characterize each image. The proposed sample-refinement strategy can be presented in Algorithm 1. Create a feature combination on the D 5 dataset.

5.
Generate the cross-modal sentimental semantics (CSS) of the feature combination by the DCA model. 6.
Perform image sentiment analysis based on the CSS and nine classifiers. 7.
Compute the average accuracy of the current css. 8.
Save the average accuracy and css into a list called LA. 9. Until all feature combinations have been tested 10. Obtain the best cross-modal sentimental semantics called css best in LA 11. Extract the css best from the D 3−5 dataset 12. Train an image classification model using the D 5 dataset and css best 13. Test the classification model using the D 3−5 dataset and css best 14. Retain the samples that are correctly predicted by the model 15. Gather all refined samples to construct the D sr dataset D. CROSS-MODAL SENTIMENTAL SEMANTICS MINING VIA DCA Image features (i.e., texture, shape, color, and edge) typically point to the same or similar sentimental semantics in images. Cross-modal sentimental semantics are present among these image features. However, a few works use the cross-modal sentimental semantics among heterogeneous image features to perform image sentiment analysis. In this study, we aim to mine the cross-modal sentimental semantics among heterogeneous image features by the state-of-the-art DCA algorithm. The cross-modal sentimental semantics (CSS) can depict the key visual contents of images accurately and comprehensively. The procedure of cross-modal sentimental semantics mining via the DCA algorithm is detailed in Figure 2.
E. ME 2 M MODEL As described in Section III (C), we propose a novel samplerefinement strategy to refine the original dataset, which builds a data foundation for image sentiment analysis. Next, we use the DCA algorithm to complete cross-modal sentiment mining among diverse image features. Based on these ideas, we propose the ME 2 M model for image sentiment analysis, which is presented in Algorithm 2. Where D' 5 belongs to the Twitter 1 * (or FI * ) dataset. We divide D' 5 into training data D train and testing data D test .

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. DATASET AND BASELINES 1) DATASET
We evaluate the ME 2 M model on two well-known datasets, Twitter 1 and FI. The detailed information about the two datasets is as follows: Twitter 1 [41]. The Twitter 1 dataset is collected from the social networking websites and labeled with sentiment polarity categories by participants from AMT. The authors [34] recruited 5 AMT workers for each of the candidate VOLUME 8, 2020 FI [42]. The FI dataset is collected by querying with eight sentiment categories (i.e. Anger, Amusement, Awe, Contentment, Disgust, Excitement, Fear, Sadness) as keywords from social networking websites. Then, 225 participants from AMT were asked to label the images, resulting in 23308 images receiving at least three agreements. The basic number distribution per class of the FI dataset is as follows. Anger: 1266, Amusement: 4942, Awe: 3151, Contentment: 5374, Disgust: 1685, Excitement: 2963, Fear: 1032, and Sadness: 2922. The FI dataset has approximately 1.5 GB of data and can be downloaded. 2 As some links are invalid, we downloaded the subset of the FI dataset. However, the current number distribution per class of the FI dataset is as follows. ''Five Agree (D 5 ),'' ''At Least Four Agree,'' and ''At Least Three Agree (D 3 )'': 5238, 12644, and 1 https://www.cs.rochester.edu/u/qyou/DeepSent/deepsentiment.html 2 https://www.cs.rochester.edu/u/qyou/deepemotion/index.html 21508, respectively. Hence, the current FI dataset consists of 21508 images. To improve the robustness and effectiveness of the ME 2 M model, we use the refinement strategy detailed in Algorithm 1 to refine the original dataset. Clearly, FI is an imbalanced dataset, which may bring more challenges for image-sentiment analysis. From the dataset, 80% of the samples are randomly chosen for training, while the rest are regarded as testing samples.
Image features, namely SIFT, LBP, GIST, VGG16, VGG19, DenseNet121, DenseNet161, ResNet50, and ResNet152, are extracted to characterize images from diverse visual perspectives. However, dimension reduction should be made to improve the real-time efficiency. For each dataset, we reduced the SIFT (S) feature to 500 dimensions. The dimensions of the GIST (G) and LBP (L) features were reduced to 512 and 1180 respectively. We selected the first fully connected layer of the VGG16 (VGG19, DenseNet121, DenseNet161, ResNet50, Resnet152) model to obtain the VGG16 (VGG19, DenseNet121, DenseNet161, ResNet50, Resnet152) feature. Hence, the VGG16 (V16) and VGG19 (V19) features are of 4096 dimensions. The DenseNet121 (D121) and the DenseNet161 (D161) features are of 1024 and 2208 dimensions respectively. Both the ResNet50 (R50) and ResNet152(R152) features are of 2048 dimensions. Feature reductions are not made on these deep learning-based features because they are dense, which contain more valuable discriminative information.

2) BASELINES
We compare the ME 2 M model including ME 2 M LR , ME 2 M NB , ME 2 M DT , ME 2 M KNN , ME 2 M RF , ME 2 M AdaBoost , ME 2 M XGBoost , and ME 2 M GBDT with five types of baselines, which are enumerated as follows: Traditional image classification models: LR, NB, DT, KNN, and RF State-of-the-art boosting models: GBDT, AdaBoost, XGBoost [40], and GS-XGB [43] Multiple variants of the ME 2 M model: DCA [1], grad-KCCA [34], and CCA [32] models only use the cross-modal analysis methods to complete image sentiment analysis.
Multiple variants of the ME 2 M model: SR-w-DCA, SRw/o-DCA. ''SR-w-DCA'' denotes that only the proposed sample-refinement strategy is used to complete image sentiment analysis. ''SR-w/o-DCA'' implies that a modified strategy of Algorithm 1 is used to complete image sentiment analysis, which uses single image feature rather than crossmodal semantics.

B. EXPERIMENTAL RESULTS
We present our experimental results in a systematic manner: the ME 2 M model that only uses the cross-modal sentimental semantics is evaluated in the first subsection. The ME 2 M model that uses the proposed samplerefinement strategy and cross-modal sentimental semantics is evaluated in the second subsection. We make a deeper comparison with state-of-theart baselines in the third subsection. Finally, we complete qualitative analysis in the fourth subsection.

1) CLASSIFICATION RESULTS OF THE ME 2 M MODEL WITH CSS
As described in Section III (D), cross-modal sentimental semantics mining is a key component of the ME 2 M model. Image features (V16, V19, D161, D121, R50, R152, S, L, and G) point to the same or similar semantics in images, indicating the presence of cross-modal sentimental semantics. We design a group of feature combinations and mine such semantics in the combinations by the DCA model. The DCA model usually uses two types of feature early-fusion modes including summation and concatenation. We use the concatenation mode here owing to its better classification performance. Section IV (C) will evaluate these two modes more comprehensively.
The new proposed feature combinations are denoted as SV19, SV16, SL, SR152, SR50, SG etc. For example, SV19 represents the cross-modal sentimental semantics between S and V19. The naming patterns of other combinations are similar to SV19. We used nine general classifiers including KNN, LR, RF, DT, NB, AdaBoost, GBDT, and XGBoost (For this model, we use 20 or 40 weak classifiers) to implement the final classification. In this subsection, the proposed sample-refinement strategy is not considered. The experimental results on two datasets are shown in Figure 3. The last column of each sub-figure presents the average accuracy of each feature combination (Avg_Feat). The last row of each sub-figure presents the average accuracy of each classifier (Avg_Clas). The two metrics are very important for evaluating the proposed model statistically.
As shown in Figure 3(a), SV19 delivers the best overall performance among all the feature combinations. The average accuracy of this feature combination (83.79%) surpasses those of other feature combinations. Meanwhile, SV19 obtains the best accuracy (86.47%) when the LR classifier is employed. Though LR is a linear classifier, it also delivers satisfactory performance owing to the application of the cross-modal sentimental semantics. Other feature combinations, such as SV16, SL, and SR152, also deliver satisfactory performances. Noticeably, the S (shape) feature is the most important visual cue for characterizing the Twitter 1 dataset. Furthermore, it complements other heterogeneous image features (e.g., V16, L, and R152) well. We found that compared to the deep learning-based features, the traditional image features play major roles in the classification procedure. For example, R152R50 achieves the worst average performance owing to the homogeneous network structure. Moreover, fewer training samples may be another vital reason. Among all the classifiers, XGBoost, NB, and GBDT deliver better performances. The ME 2 M model should choose a suitable classifier to improve the classification performance. Certainly, the performance gaps among different classifiers are low, which demonstrates the effectiveness of the idea of multidimensional evidence.
In Figure 3(b), V19V16 achieves the best overall performance among all the feature combinations. The average accuracy of this combination is better than those of the other feature combinations. Meanwhile, SV19 achieves the best accuracy (73.71%) when the LR classifier is employed. Other feature combinations, including V19R152, D161V19, and R152V16, also achieve satisfactory performances. Clearly, the deep learning-based feature (V) is the most important visual cue for characterizing the FI dataset. It complements other features (e.g., R152, D161, and S) well. Evidently, deep learning-based features play significant roles in this dataset. We conclude that the implicit cross-modal sentimental semantics among heterogeneous features (i.e. V19 and R152) is fully mined by the DCA algorithm. Moreover, the FI dataset has more training images, which forms an important data foundation for deep-level feature extraction. Compared to other classifiers, XGBoost, LR, and GBDT achieve better performances. However, compared to the Twitter 1 dataset, it is a big challenge on the FI dataset owing to the higher number of sentiment categories.
In short, the SV19 combination is a good choice for completing image sentiment analysis on the Twitter 1 dataset. The traditional image features bring positive effect on a relatively small dataset. However, the V19V16 combination is a good choice for completing image sentiment analysis on the FI dataset. The deep learning-based features bring positive effect on a relatively large dataset. Hence, the cross-modal sentimental semantics is effective. Moreover, it is better to choose classifiers such as XGBoost, GBDT, and LR to construct the ME 2 M model.

2) CLASSIFICATION RESULTS OF THE ME 2 M MODEL WITH CSS AND SR
As described in Section III (C), sample-refinement strategy is a vital component of the ME 2 M model. We use the sample-refinement strategy (Algorithm 1) to enrich our training data. We mine the cross-modal sentimental semantics among diverse image features following sample-refinement. Other experimental settings are similar to those in the first subsection of Section IV (B). The experimental results on two datasets are shown in Figure 4.
In Figure 4(a), SV19 demonstrates the best overall performance among all the feature combinations. This result is consistent with Figure 3(a). The average accuracy of this combination (85.30%) is better than those of the other feature combinations. Compared to the best Avg_Feat in Figure 3(a), a 1.51% improvement in performance is observed, which indicates that the proposed sample-refinement strategy is effective. Meanwhile, similar to Figure 3(a), the SV19 combination obtains the best accuracy (87.15%) when the LR classifier is employed. Compared to the best accuracy in Figure 3(a), an improvement of 0.68% is observed in  performance. Other feature combinations including SR50, SG, and SR152 also achieve satisfactory performance. Evidently, more high-quality samples with definite sentimental semantics are mined out to enrich the original dataset and enhance the final performance. Similarly, the traditional SIFT (S) feature plays a more important role in image sentiment analysis. Among all the classifiers, XGBoost, NB, and RF obtain better classification performance (We compare the ME 2 M model with the first category baseline in Table 1. The ME 2 M model outperforms the corresponding baselines). Compared to the best Avg_Clas in Figure 3(a), a 1.65% performance improvement is observed. Finally, the mean average accuracy of Figure 3(a) is 74.90%, whereas the corresponding value of Figure 4(a) is 76.38%, which further establishes that the proposed sample-refinement is effective and robust.
In Figure 4(b), V19V16 achieves the best overall performance among all the feature combinations. This result is consistent with Figure 3(b). The average accuracy of the V19V16 combination (70.72%) is better than those of the other feature combinations. Compared to the best Avg_Feat in Figure 3(b), a performance improvement of 2.50% is observed. Meanwhile, SV19 achieves the best accuracy (75.55%) when the LR classifier is employed. Compared to the best accuracy in Figure 3(b), a 1.84% performance improvement is observed. Other feature combinations, including V19R50, SV19, and R152V19, also achieve satisfactory performances. Clearly, more high-quality samples with definite sentimental semantics are mined out to improve the final classification performance. Similarly, the deep learning-based feature (V19) plays a more crucial role in image sentiment analysis. Among all the classifiers, XGBoost, LR, and GBDT achieve better performances (We compare the ME 2 M model with the first category baseline in Table 1. On the Twitter 1 dataset, the ME 2 M KNN model achieves the highest performance improvement (7.79%)) Higher performance improvements can be observed on the FI dataset, especially for the AdaBoost, XGBoost, and GBDT classifiers. The FI dataset has more samples with finegrained sentimental semantics, which can build a foundation for cross-modal sentimental semantics mining. Moreover, the proposed sample-refinement strategy brings us more data that are valuable. Compared to the best Avg_Clas in Figure 3(b), a 3.91% performance improvement is observed. The mean average accuracy of Figure 3(b) and Figure 4(b) are 59.31% and 63.02% respectively, which also demonstrates that the proposed sample-refinement is effective and robust. Moreover, we found that larger performance improvements are observed on the FI dataset.
This indicates that the proposed sample-refinement strategy can mine out high-quality samples from the fine-grained dataset. In future, we plan to incorporate a popular GANbased data augment model subsequent to the proposed sample-refinement strategy. We hope this novel idea could bring us many more valuable refined samples.
In brief, apart from cross-modal sentimental semantics mining, the proposed sample-refinement strategy is another vital component of the ME 2 M model. Larger performance improvements can be observed on the fine-grained FI dataset. Meanwhile, XGBoost or LR may be a good choice for constructing the ME 2 M model. Finally, though the new refined samples are chosen to supplement the original dataset, we get the same or similar experimental phenomena, demonstrating the robustness of the ME 2 M model.

3) FINAL COMPARISONS
To demonstrate the effectiveness of the ME 2 M model further, we compare the ME 2 M model with several state-ofthe-art baselines. The experimental results are summarized in Table 2. Besides the best accuracy, we add the best average accuracy of the ME 2 M model into Table 2. The average accuracy has a certain statistical significance because nine classifiers and sixteen feature combinations are utilized in our experiments. For example, ME 2 M (A) represents the best average accuracy, whereas ME 2 M (M) represents the best accuracy.
Firstly, the DCA algorithm outperforms other popular cross-modal analysis models with a large performance margin. Compared to the CCA and gradKCCA models, the DCA algorithm has lower dimensions, which can also promote the real-time efficiency of the ME 2 M model. Secondly, the proposed sample-refinement strategy (SR-w-DCA) makes sense. Compared to the SR-w-DCA and SR-w/o-DCA models, we found that the DCA algorithm plays an important role in the sample-refinement strategy. This not only demonstrates the effectiveness of the DCA algorithm, but also reflects the extensibility of the proposed sample-refinement strategy. High-quality samples are mined and utilized to train a more effective classification model. Thirdly, the ME 2 M model can get higher performance improvements on the fine-grained FI dataset, which is similar to the corresponding conclusion of Figure 4(b). Fourthly, the ME 2 M model outperforms the state-of-the-art baselines in both metrics on the Twitter 1 dataset. Larger performance gaps can be observed. Similar experimental phenomena can be observed on the FI dataset. The ME 2 M model is effective and robust for image sentiment analysis. We found that compared to the Twitter 1 dataset the ME 2 M (M) model obtains larger performance improvement VOLUME 8, 2020 TABLE 2. Classification performance comparisons between the ME 2 M model and several state-of-the-art baselines. (Note: the best value in each sub-part is shown as 87. 15. The unit is %).
(5.48%) on the FI dataset. This further indicates the proposed sample-refinement strategy can mine out high-quality samples to train a robust classification model. Hence, as expected, the data scarcity problem is suppressed to a certain degree.
Finally, Table 2 presents an ablation analysis idea. For example, on the Twitter 1 dataset, 1.48% performance degradation is obtained if we make the ablation analysis: ''ME 2 M (M) → SR-w-DCA.'' Meanwhile, 0.68% performance degradation is obtained if we make the ablation analysis: ''ME 2 M (M) → DCA.'' These results demonstrate that the idea of cross-modal sentimental semantics mining is more important than the proposed sample-refinement strategy on a relatively small dataset. Similar experimental phenomena can be observed on the FI dataset. However, the proposed samplerefinement strategy plays a more important role in a relatively large dataset. A higher number of sentiment categories brings us more information that is valuable for image sentiment analysis. This motivate us to use a deep mutual learning [45] strategy to further boost the final performance on a finegrained dataset. Summarily, no matter what the best accuracy or best average accuracy is, the ME 2 M model is highly competitive on both coarse-grained and fine-grained image sentiment datasets.

4) QUALITATIVE EXPERIMENTAL ANALYSIS
We demonstrate the effectiveness of the ME 2 M model with the help of some qualitative examples. The experimental results for Twitter 1 and FI are presented in Figure 5 and Figure 6 respectively.
In Figure 5, most of the correct positive images predicted by the ME 2 M model have smiling faces, which exhibits a definite sentimental tendency. Meanwhile, some positive images  usually contain a wealth of information on color. Conversely, most of the correct negative images predicted by the ME 2 M model usually have sad faces or dark colors. Apparently, owing to the absence of faces or presence of natural scenes in the images, the image sentiment analysis is still a huge challenge on the coarse-grained dataset (Twitter 1). In Figure 6, fine-grained sentimental predictions are also dominated by the emotion of faces, supplemented by the color of images, and the comparison of perceptions. Moreover, the negative sentimental tendency is greatly affected by colors. The negative sentiment images with relatively bright colors are very hard to be predicted correctly. In future, we plan to use some state-of-the-art color features such as EMK-Color to depict color information and thus improve the final performance. Therefore, we evaluate the two feature early-fusion modes, namely summation (Sum) and concatenation (Concat) here. We use the same experimental settings that are similar to the second VOLUME 8, 2020 subsection of Section IV (B). The experimental results are shown in Figure 7. ''Top'' means the best accuracy of the corresponding feature early-fusion mode, whereas ''Average'' means the best average accuracy of the corresponding feature early-fusion mode.
In Figure 7, on both datasets, the concatenation mode outperforms the summation mode in a large margin. We found that the concatenation mode can maximally retain the key discriminant information of transformed features (please refer to Figure 2). However, owing to low dimensions, the key discriminant information is diluted after the summation operation, possibly affecting the final classification performance. Hence, we choose the concatenation mode to complete all experiments.

2) EVALUATION OF THE DATA QUALITY OF ORIGINAL DATASET
For each sentimental polarity, the original datasets can be organized as three types including ''Five Agree,'' ''At Least Four Agree,'' and ''At Least Three Agree.'' For example, ''At Least Three Agree'' indicates that at least 3 AMT workers gave the same sentiment label for a given image. The corresponding dataset (D 3 ) is a low-quality dataset. ''Five Agree'' indicates that all of the 5 AMT workers gave the same sentiment label for a given image. The corresponding dataset (D 5 ) is a high-quality dataset. This means that the given image has a strong or definite sentimental polarity. Hence, theoretically, the ''Five Agree'' dataset is the best original dataset for training an effective classification model. It will build a data foundation for image sentiment classification. Based on the ''Five Agree'' dataset, we propose the sample-refinement strategy (Algorithm 1) and use additional refined samples to train a robust classification model. We demonstrate this in Figure 8.
''Mix'' means that the best traditional feature (SIFT) and any other image feature introduced above are utilized to complete cross-modal sentimental semantics mining. ''Deep'' implies that only the above-mentioned deep learning-based features are utilized to complete cross-modal sentimental semantics mining. ''5'' represents ''Five Agree,'' ''4'' represents ''At Least Four Agree,'' ''3'' represents ''At Least Three Agree.'' In this subsection, we do not use any samplerefinement strategy.
Hence, we choose the best feature combination of the ''Mix'' and ''Deep'' modes, respectively, to draw Figure 8. In Figure 8, on both datasets, the ''Five Agree'' is the best choice for training an effective classification model. This condition is particularly evident on the FI dataset. The images in the ''Five Agree'' have higher quality. Moreover, traditional image features such as ''S'' complement other features well. Contrarily, deep learning-based features are insufficient for characterizing sentimental semantics. The performance gap between the mixed features and deep learning-based features is relatively smaller on the FI dataset. We infer that this is due to the number of different datasets. Larger performance improvements are observed on the FI dataset when different original dataset is employed, which indicates that the classification performance is more satisfactory on the fine-grained dataset if the images on the corresponding dataset have few ambiguities. These results also motivate us to recommend the sample-refinement strategy (Algorithm 1).
In conclusion, we choose the ''Five Agree'' data in each dataset to construct the original dataset (D 5 ). Then we use the proposed sample-refinement strategy to mine many more high-quality samples from the D 3−5 dataset and supplement the original dataset. Two new datasets named Twitter 1 * and FI * are obtained. Then the ME 2 M model uses the new dataset (Twitter 1 * and FI * ) to complete all our experiments.

3) EVALUATION OF THE NUMBER OF REFINED SAMPLES
High-quality data are very important for training an effective classification model. However, is more the amount of data, the better? Hence, to check that, we randomly split the refined samples generated by the sample-refinement strategy (Algorithm 1) into five parts. The ME 2 M model uses different data ratios (20%, 40%, 60%, 80%, and 100%) to complete the following experiments. The experimental results are depicted in Figure 9. ''Top'' means the best accuracy while ''Average'' means the best average accuracy.
In Figure 9 (a), a sudden change in performance of the ''Top'' line chart is observed when the ME 2 M model uses 40% refined samples. We deduce that these refined samples must contain valuable discriminant information. A little performance degradation is perceived when more refinedsamples are used. Similar experimental phenomena can be observed on the ''Average'' line chart, which demonstrates that the ME 2 M model is robust.
Similarly, shown in Figure 9 (b), the corresponding classification performance shows a relatively steady rise. The ME 2 M model achieves the best performance when 100% refined samples are used. Similar experimental phenomena can be observed on the ''Average'' line chart. Compared to the Twitter 1 dataset, greater performance improvements are observed when a greater number of refined samples is used. Hence, owing to a greater number of sentiment categories on the fine-grained FI dataset, the ME 2 M model needs more high-quality samples to improve the final classification performance.
In summary, the proposed sample-refinement strategy makes sense. High-quality samples are necessary for training an effective classification model. As a trade-off, we choose 100% refined samples to complete our experiments

4) VISUALIZATION RESULTS OF CROSS-MODAL SENTIMENTAL SEMANTICS
To intuitively demonstrate the effectiveness of cross-modal sentimental semantics mining, we employ the well-known t-SNE [46] tool to visualize the sample distribution after cross-modal sentimental semantics mining. The experimental results are displayed in Figure 10. Owing to space limitations, only the visualization results of the Twitter 1 dataset   is provided. Similar experimental results can be found on the FI dataset.
In Figure 10 (a) ∼ 10 (b), the distributions of the original features are chaotic. Samples of different categories congre-gate, which may produce a huge challenge for classifiers. In Figure 10 (c) ∼ 10 (f), the boundaries of different categories become clearer when the CCA model is employed, and the aggregation of cross-modal sentimental semantics is more obvious when the DCA model is employed. We obtain the best classification performance if some strong classifiers, such as XGBoost and GBDT, are used. Hence, Figure 10 supports well all the above-discussed experimental results.

5) EVALUATION OF THE REAL-TIME EFFICIENCY
The ME 2 M model is not only effective but also efficient. This model is deployed on a relatively high-performance laptop with an Intel (R) Core (TM) i7-8550, MX150, and 8G RAM. We computed the real-time efficiency of the ME 2 M model and is listed in Table 3. (Owing to content limitations, we use only the Twitter 1 dataset that contains 882 images to exhibit our experimental results). For each metric, we compute the average elapsed time of each image. For the cross-modal sentimental semantics, we choose the top 3 feature combinations and the worst 3 feature combinations among all the 16 combinations to complete the comparisons.
As shown in Table 3, feature extraction methods need more time, especially for the traditional image features such as SIFT and GIST. Contrarily, the forward propagation procedures of the pretrained deep learning models are faster. Cross-modal sentimental semantics mining needs less time. Compared to the traditional classifiers such as KNN, LR, RF, DT, the boosting-based algorithms (i.e., GBDT and XGBoost) need more time owing to the use of a large number of weak classifiers. However, the boosting-based classifiers can obtain higher classification performance. Hence, there is a tradeoff between efficiency and effectiveness. XGBoost-20, NB and KNN are effective (please refer to Figure 3 (a)) and efficient. For example, XGBoost-20 needs 2.17s per image to complete training and it needs 7.52e-5s per image to complete testing. Moreover, ''0'' can be observed in Table 3. Test time is really too small, especially for some linear classifiers.

V. CONCLUSION AND FUTURE WORK
Image sentiment analysis has very important significance these days. On one hand, the corresponding models can capture emotions of people and analyze their mental states, which can help them to get timely psychological treatment, if there is need for it. On the other hand, some popularity predictions can be made based on image sentiment analysis, which can bring more social or economic benefits. In this paper, we proposed a simple but effective model called ME 2 M for image sentiment analysis. This model focuses on cross-modal sentimental semantics mining and sample-refinement. Experimental results demonstrated the effectiveness and robustness of the ME 2 M model.
In future, we plan to use other state-of-the-art cross-modal analysis model such as CCL [47], DCCA [35] to complete cross-modal sentimental semantics mining. We intend to incorporate a popular GAN-based data-augment model such as DCGAN [48] or ACGAN [49], subsequent to the proposed sample-refinement strategy. This may offer us many more high-quality samples. We also plan to use certain state-of-theart color features, such as EMK-Color [50], to depict color information and thus improve the final performance.