MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

Multimodal sentiment analysis is a challenging task in the field of natural language processing (NLP). It uses multimodal signals (natural language, facial gestures, and acoustic behavior) in videos to generate emotional understanding. However, the importance of single modality data in the video to emotional outcomes is not static. With the extension of the time dimension, the emotional attributes of a specific natural language will be affected by non-natural language data, resulting in a vector shift in the feature space. At the same time, long-term dependencies within a specific modality and long-term dependencies between multiple modalities that are “unaligned” need to be considered. In response to the above problems, this paper proposes Multimodal Encoding-Decoding Network with Transformer. The network model encodes multimodal data through a Bidirectional Encoder Representations from Transformers (BERT) network and Transformer encoder to resolve long-term dependencies within modalities. And the network reconstructs the Transformer decoder to solve the weight problem of multimodal data in an iterative way. The network fully considers the long-term dependencies between modalities and the offset effect of non-natural language data on natural language data. Under the same experimental conditions, we validated our model on general multimodal sentiment analysis datasets. Compared with state-of-the-art models, the network achieves good progress and strong stability.


I. INTRODUCTION
Sentiment analysis has always been a popular research direction in the field of NLP. In the early days, most of the work was focused on unimodal research-mainly plain text sentiment analysis [1], [2] -in which the investigations were limited to determining the usage of words in positive and negative scenarios [3] and obtaining emotional results by analyzing the meaning of specific word combinations. Further analysis of human behavior shows that humans transmit information not only through natural language but also through non-natural language (visual and acoustic) [4]. This rich behavioral information can better help us understand The associate editor coordinating the review of this manuscript and approving it for publication was Mostafa Rahimi Azghadi . human emotional intentions [5]. This behavioral information is considered to be the multimodal language of human beings. With the rapid development of online media, more and more people tend to use video to record their comments and opinions on products or movies. This requires a multi-dimensional analysis of people's opinions and emotions in the video to better understand the information it conveys [6]. Moreover, with the maturity of audio and video feature extraction methods [7], the research progress of multimodal sentiment analysis has also been accelerated. Currently, modeling multimodal language for emotional understanding has become the central research direction of NLP and multimodal machine learning [8]- [10].
With further research, we found that although multimodal language information is processed at the same time, it is still the natural language that plays a decisive role in the final emotional understanding. It is difficult for us to analyze the intentions of an actor by relying only on visual or acoustic behaviors because the non-natural language behaviors of people expressing the same emotion are usually different. Suppose a person shows an emotional state for a certain thing, but it is almost difficult for us to determine whether it is a positive emotion or a negative emotion through facial expressions. When we combine facial behavior with different natural language descriptions, we can clearly understand an actor's emotional intentions, but this will enhance or weaken the original emotion expressed in the current natural language. This leads to another problem. Since multimodal language communication occurs through natural language and non-natural language channels, the meanings of words and sentences transmitted by humans through natural language change dynamically in different non-natural language contexts [11], [12]. In other words, for a sentence that expresses positive emotions in the field of purely natural language, the meanings of words within the language are fixed in the vector space. When nonnatural language behavior is introduced, it will cause the words to shift in the original vector space. The specific change is reflected in the strength and direction and even causes its meaning to be biased to the opposite side.
Furthermore, the heterogeneity of cross-modality typically increases the difficulty of the analysis of human language because the variable sampling rate for each modal sequence can lead to misaligned inherent data [13], expressed as an ''unaligned'' multimodal language sequence. Therefore, the final result of multimodal emotional discrimination is affected not only by the long-term dependence relationship within a specific mode but also by the long-term dependence relationship between ''unaligned'' multiple modalities. Therefore, how to coordinate the long-term dependencies within the modalities and the long-term dependencies between ''unaligned'' multiple modalities is a very important research topic.
In response to the above problems, we proposed the multimodal encoding-decoding network with transformer, which is a model for handling human ''unaligned'' multimodal languages. The main contributions of this paper are: • A new model for processing multi-modal data, which is used to solve the problem of the dynamic change of the weight of multi-modal data in the time dimension, and update the cross-modal weight value in an iterative manner.
• Solve long-term dependencies within a single modality and long-term dependencies across modalities, focusing on solving the problem of the offset of the meaning of words in natural language data caused by non-natural language data.
In order to verify the performance of our model in multimodal sentiment analysis, we conducted experiments on the benchmark CMU-MOSI and CMU-MOSEI datasets. We retrained our model and the latest model previously proposed in the same experimental environment and evaluated and compared the final results. In all benchmark tests, our model can outperform the benchmark and is more stable than other models.
The remainder of this article is organized as follows. In Sect. II, we introduce some related work on multimodal emotion recognition. In Sect. III, we elaborate on the overall architecture of our model. In Sect. IV, we describe the data set and baseline model used in the experiment in detail. In Sect. V, we present the results of the experiment and report the necessary analysis. We summarized our model and elaborated on future work in Sect. VI.

II. RELATED WORKS
In this section, we mainly discuss the related work of multimodal sentiment analysis and briefly introduce the two basic models we will use.

A. MULTIMODAL SENTIMENT ANALYSIS
Multimodal sentiment analysis is now a popular research direction. It models natural language and non-natural language to gain emotional understanding. With the emergence of a large number of multimodal datasets (such as CMU-MOSI [14] and CMU-MOSEI [15]), scholars have successively proposed many models for multimodal sentiment analysis. In early work, fusion methods directly connected multiple modal data [16]- [19], and the primary and secondary relationships between the modes were not studied. For example, in literature [16], the author regards the problem of multimodal sentiment analysis as dynamic modeling within and between modalities. The single-mode, dual-mode, and three-mode dynamics are explicitly modeled by calculating the vector field of the triple Cartesian product, and the multimodal emotion fusion tensor is calculated. In [17], the author applies LSTM to each modal view to learn the interaction of a specific view and reconstructs the LSTM memory network to learn multimodal cross-attempt interaction information. In [18], the author decomposes the fusion problem into multiple stages, and each stage focuses on a subset of multimodal signals for specialized and effective fusion. Then, the fusion method is combined with the recurrent neural network system to model the interactions in time and modalities. In literature [19], the author proposed a multimodal attention framework based on recurrent neural networks to learn the joint relationships between multiple modalities and discourse and used contextual information for discourse-level emotion prediction. The multimodal fusion methods mentioned above all put multiple modal information in the same position without emphasizing the primary and secondary relationships between each modal information. Our research is closer to the work reported in [12], [20], [21] confirming that natural language information occupies an important position in multimodal sentiment analysis. In literature [12], for the first time, the author proposed that a speaker's intentions usually change dynamically according to different nonlanguage environments. When modeling human language, not only the literal meaning of words but also the nonverbal context in which these words appear must be considered. To this end, the author proposes a gated modal hybrid network, which dynamically moves word representations based on nonverbal cues. In literature [21], the author combines the gated modal hybrid network mentioned above with the BERT model without changing the basic structure of the BERT model, which can actually accept nonverbal information. These studies all use textual information as an important carrier and then introduce nonverbal behavior as auxiliary information to form a multimodal emotional understanding. The core of our work is to use natural language information as the dominant modality and non-natural language information as the auxiliary modality to obtain the fusion vector representation in the natural language vector space. Table. 1 summarizes and compares the two multimodal fusion methods.

B. TRANSFORMER AND BERT
In our experiment, two basic encoding networks, the transformer [22] and BERT (Bidirectional Encoder Representations from Transformers) [23], are mainly involved. A transformer is an acyclic neural architecture designed for sequence data modeling. It discards the RNN and CNN as the basic models of sequence learning and completely adopts the attention mechanism; therefore, the architecture does not have the ability to capture sequential sequences. For this reason, the author uses position embedding in the architecture to represent time-series information and finally achieves better performance than the loop structure in terms of results, speed, and depth. BERT is a successful application of a transformer and a successful language model. The input embedding of this model is generated by adding token embedding, segment embedding, and position embedding. Then, multiple encoder layers are applied on top of these input embeddings. Each encoder has a multi-head attention layer and a feedforward layer, and each layer has a residual connection with layer normalization. BERT adopts the automatic coding method to learn the vector representation of the masked mark in the process.

III. PROPOSED APPROACH
In this section, we introduce in detail the Multimodal Encoding-Decoding Network with Transformer (MEDT). The purpose of MEDT is to solve the problem of the ''unalignment'' characteristic of multimodal language sequences, introduce word transfer representation, and finally obtain multimodal emotional fusion vector representation. Different from the previous strategy, we adopted a joint encoding and decoding method with text as the main information and sound and image as auxiliary information to obtain the emotion fusion vector representation. Our model can be divided into two parts: 1) The unimodal encoder, which is used to handle the long-term dependencies within the modal and to encode unimodal information; and 2) The multimodal joint-decoder, which is used to solve the long-term dependencies between ''unaligned'' multiple modalities, dynamically update the weight attribute values between different modalities at different times, and finally obtain multimodal fusion feature representation. Fig. 1 shows the overall architecture of the model.
The input of the MEDT is multimodal sequence data. This article mainly handles the following three types of multimodal sequence data: natural language {Language(l)} and non-natural language {Visual(v), Acoustic(a)}, where Language is the original text data I l input into the BERT model (see Sect. III-A1). The initial feature vector of Visual and Acoustic is expressed as I m ∈ R T m ×d m (m = {a, v}), where d m and T m represent their respective time dimension and feature dimension.

A. UNIMODAL ENCODER
In this part, we explained the unimodal encoder, and we used different encoding methods for natural language and nonnatural language.

1) NATURAL LANGUAGE ENCODER
We used a pre-trained BERT [23] model that performed well in the text domain to encode plain text to extract sentence representations with long-term dependencies. We consider that text information plays a leading role in the final results of sentiment analysis. In order to ensure that the BERT model can extract sentence representations containing sentiment information, we roughly fine-tuned the BERT network on the pure text sentiment classification dataset. The fine-tuned BERT model does not need to achieve the best accuracy and is only used to ensure that a general sentence representation with emotional attributes is obtained. When we apply the model to text in multimodal data, we will also perform synchronous training. We apply the 12-layer BERT to the IMDB dataset [24], which contains 50,000 positive and negative reviews from the movie database, including 36,000 training set samples, 4,000 validation set samples, and 10,000 tests set samples. The fine-tuned BERT model can achieve 89.42% accuracy on the IMDB dataset sentiment dual classification task.
Given the original text data I l = [I 1 , I 2 , · · · , I N ], N is the number of samples. Each sample I n (n ∈ N ) is a language sequence I n = [i 1 , i 2 , · · · , i T ] ∈ R T ×d that carries T wordpiece tokens. Two special tokens [CLS] and [SEP] are added to I n , and we will use the former to predict emotions later. Then, we input I n into the input embedder, and its output is the input encoding vector E n = [e CLS , e 1 , e 2 , · · · , e T , e SEP ] of BERT after adding markers, segments, and position embedding.
T l is equal to T plus two special symbols. d is the initial encoding vector dimension. Finally, we input E n into the fine-tuned BERT model and obtain the lexical embedding X n of the last layer as the text embedding X l with long-term dependencies.
where d l represents the feature dimension of the language modality after passing through the BERT network, which is 768 dimensions.

2) NON-NATURAL LANGUAGE ENCODER
For visual and acoustic data I m ∈ R T m ×d m (m = {a, v}), we emulate the way the transformer encodes text data, and we apply the encoder of the transformer to non-natural language data. In this paper, for convenience, we call it a Non-natural Language Transformer Encoder (NNLE). For any natural language, the position and order of words in a sentence are very important. They are not only part of the grammatical structure of a sentence but also important concepts that express semantics. If the position or sequence of a word in a sentence is different, the meaning of the entire sentence may deviate. Similar to natural language, for non-natural language data with a time dimension, such as continuous changes of facial expressions or voice intonations, if the arrangement order is different, the meaning expressed will also be affected. Since the transformer model discards the RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) as the basic models of sequence learning, it completely adopts the attention mechanism, which means that the transformer model does not have the ability to capture time series. In order to enable the sequence to carry time information, following [22], we add position information embedding (PE) to I m and then apply a Position-wise Feedforward Network (PFN) to obtain non-natural language embedding data P m ∈ R T m ×d m (m = {a, v}) with relative position information: The reason why the network is position-wise is that the transformation parameters of each position t are the same when passing the linear layer. PE (I m ) ∈ R T m ×d m calculates the fixed position embedding of each position index of the non-natural language data in the time dimension. We leave more details of the positional embedding to I. NNLE is the same as the traditional transformer encoder (see Fig. 1, right). It consists of N identical coding layers, each layer consists of two sublayers, and each sublayer introduces residual connections and layer normalization. The overall structure is summarized as: The two sublayers are the multi-head attention mechanism (MHA) and position-wise fully connected feedforward network (FFN). The first sublayer MHA utilizes a self-attention block defined as a scaled dot product function: where Q, K , and V are input vectors with the same shape. Expression √ d k is a scaling factor, where d k is the feature dimension of the input vector. Multi-head means projecting Q, K , and V through h different linear transformations and finally stitching together different attention results: The second sublayer FFN consists of two linear transformations, and the first linear transformation is followed by a ReLU activation function. Similar to the PFN, the FFN is position-wise because the transformation parameters of each position t are the same when passing through the linear layer: Assume that the input of the 0th layer Z [0] m = P m . In the ith coding layer, the output Z [i−1] m of the previous layer first passes through the multihead attention block to obtain the intermediate output Z [i] m , which can be expressed as: Then, through the second sublayer feedforward network, the final output Z [i] m of the ith coding layer is obtained: Finally, after N coding layers, we obtain the non-natural language embedding X m = Z [N ] m ∈ R T m ×d m (m ∈ {a, v}) with timing information.

B. MULTIMODAL JOINT-DECODER
In this part, we reconstruct the decoder of the transformer to obtain the multimodal fusion embedding representation, which is called the multimodal joint-decoder. First, this network considers the characteristics of word vectors in the feature space that are affected by non-natural language data; and second, the network solves the problem of long-term dependence between cross-modalities. (Fig. 2 shows the overall structure). The multimodal joint-decoding layer is composed of two sublayers. The second sublayer adopts a position-wise fully connected feed-forward network such as NNLE. The difference between the two networks is that in the multihead cross-modal attention mechanism, we use a cross-modal attention block (CM ) containing a scaled dot product function [13].
In the cross-modal attention block (CM ), two different modal vectors X β ∈ R T β ×d β and X α ∈ R T α ×d α are given. We define the queries as Q β = X β W Q β , the keys as K α = X α W k α , and the values as are the weights of the linear transformation. Therefore, the cross-modal embedding Y β ∈ R T β ×d k from β to α can be obtained: Among the variables, Y β and Q β have the same length (that is, T β ), but the data information comes from the feature space of V α . Expression √ d k is a scaling factor. In particular, the function softmax calculates the attention score matrix S ∈ R T β ×T α from modality β to modality α, where the (i, j)th score indicates the degree of correlation between the information at the ith time step of modality β and the information at the jth time step of modality α. Hence, the ith time step of Y β is a weighted summary of V α , with the weight determined by the ith row in the attention score matrix S. However, CM α→β is just single-head cross-modal attention. The multi-head uses h different linear transformations to project X β and X α and finally splices the different attention results to obtain the cross-modal embedding representation X α→β = X β ∈ R T β ×d β of modality β to modality α: For the convenience of the following description, we summarize the multi-head cross-modal attention mechanism as: The input of the multimodal joint-decoder is three types of coded multimodal embeddings X l ∈ R T l ×d l , X a ∈ R T a ×d a , and X v ∈ R T v ×d v , where X l is the text embedding through the BERT encoder, and X a and X v are the acoustic and visual embeddings with time information through the NNLE, respectively. In this paper, we always use the natural language data X l as the query vector and use X a and X v as the key and value vectors, respectively, to obtain the cross-modal fusion embedding representation after the text embedding is offset under the influence of non-natural language data.
The multimodal decoder is composed of N multimodal decoding layers. Assuming that the text embedding of the 0th layer is represented as Z [0] l = X l ∈ R T l ×d l , for the ith decoding layer, we first use audio X a as the initial keys and values to obtain the cross-modal text embedding representation Z * [i] a→l ∈ R T l ×d l : At this time, the data in Z * [i] a→l belong to the text feature space, but compared to before, the data shift in direction under the influence of audio information. Then, X v is passed into the

cross-modal attention mechanism as the new keys and values, and Z * [i]
a→l is used as the query vector to further obtain the new cross-modal text embedding representation Z * [i] v→l ∈ R T l ×d l : Here, Z * [i] v→l is a text embedding representation containing two types of non-natural language information. Finally, after a position-wise feed-forward network, the cross-modal text embedding representation Z [i] l of the ith layer is obtained.
Finally, after n decoding layers, we obtain the final cross-modal text embedding representation X (a,v)→l = Z [N ] l ∈ R T l ×d l . Then, we choose the feature vector of the special symbol token [CLS] in the text embedding as the embedding representation X f ∈ R d l of multimodal fusion information and further use it for sentiment analysis.

IV. EXPERIMENTAL SETTINGS
In this section, we introduce our experimental settings, including the experimental datasets, baselines, and evaluations.

A. DATASETS
In this work, we use two public multimodal sentiment analysis datasets: MOSI and MOSEI. Here, we give a brief introduction to the above datasets.

1) MOSI
The CMU-MOSI [14] dataset is one of the most popular benchmark datasets for multimodal sentiment analysis. It comprises 2,199 short monologue video clips taken from 93 YouTube movie review videos. Human annotators label each sample with a sentiment score from -3 (strongly negative) to 3 (strongly positive). We further processed the dataset and divided it into a training set containing 1,284 samples, a validation set containing 229 samples, and a test set containing 686 samples.

2) MOSEI
The CMU-MOSEI [15] dataset expands its data with a higher number of utterances and greater variety in samples, speakers, and topics than CMU-MOSI. The dataset contains 22,856 annotated video segments (utterances) from 5,000 videos, 1,000 distinct speakers and 250 different topics. We also processed the dataset further and divided it into a training set containing 16,326 samples, a validation set containing 1,871 samples, and a test set containing 4,659 samples.

B. BASELINES
In order to verify the performance of the MEDT, we conducted a fair comparison with the following various stateof-the-art models for multimodal language analysis. These VOLUME 10, 2022 TABLE 2. Results for multimodal sentiment analysis on CMU-MOSI and CMU-MOSEI with aligned and unaligned multimodal sequences. For the performance indicators, h means higher is better and means lower is better. (SD) is the standard deviation of the results of five experiments. For the model, (B) means that the language features are based on BERT. In Acc-2 and F1-Score, the left of the ''/'' is calculated as ''neg./non-neg. '' and the right is calculated as ''neg./pos. ''. models are trained using extracted BERT word embeddings as their language input: TFN: The Tensor Fusion Network [16] creates a multidimensional tensor to capture unimodal, bimodal, and trimodal interactions and explicitly model intramodal and intermodal dynamics.
LMF: Low-rank Multimodal Fusion [25] is an improvement of the TFN that uses a low-rank tensor to perform multimodal fusion to improve efficiency.
MulT: The Multimodal Transformer [13] adopts directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapts streams from one modality to another.
MISA: Modality-Invariant and -Specific Representations [20] project each modality to two subspaces of modal invariant and specific modalities to capture cross-modal commonality and unimodal private features for task prediction fusion.
SelF-MM: The Self-Supervised Multitask Multimodal sentiment analysis network [21] obtains an informative unimodal representation by jointly learning a multimodal task and three unimodal subtasks. Among the subtasks, the label of the unimodal subtask is obtained through a label generation module based on a self-supervised learning strategy. Then, the multimodal and single-modal tasks are jointly trained to learn consistency and differences, respectively.

C. EXPERIMENTAL DESIGN 1) EXPERIMENTAL DETAILS
We use Adam as the optimizer and the initial learning rate of 5e-5 for the BERT natural language encoder. The learning rate of the two non-natural language encoders is 0.001, and the learning rate of the multimodal decoder and other networks is 0.0001. For a fair comparison, we conducted experiments on our model and the model mentioned above under the same experimental conditions. We ran each model five times and reported the average performance.

2) EVALUATION METRICS
Following previous work [15], [20], the emotional intensity predictions using the MOSI and MOSEI datasets are regression tasks, and the mean absolute error (MAE) and Pearson correlation (Corr) are used as the performance indicators. Additionally, the benchmark also involves classification scores that include seven-class accuracy (Acc-7) and five-class accuracy (Acc-5) ranging from -3 to 3, binary accuracy (Acc-2), and the F-Score. For the binary accuracy score, we chose two different evaluation methods. The first is negative/non-negative classification, where non-negative labels are based on scores ≥ 0 [26]. The second is the more accurate negative/positive classification, where negative and positive classes are assigned to sentiment scores of < 0 and > 0, respectively [13]. We use the segment mark -/-to report the results of these two indicators, where the score on the left represents neg./non-neg. and the score on the right is neg./pos. Furthermore, we calculate the standard deviation (SD) of the five experimental results of the abovementioned evaluation indexes and use it as the stability index of the model.

V. RESULTS AND DISCUSSION
In this section, we have conducted a detailed analysis and discussion of the experimental results on the CMU-MOSI and CMU-MOSEI datasets.
A. RESULTS Table. 2 shows the comparison results on the MOSI and MOSEI datasets. For a fair comparison, we experimented with our model and the benchmark models under the same experimental conditions. Following previous work [20], [21], TABLE 3. Examples from the CMU-MOSI data set. The true emotional label lies between strongly negative (−3) and strongly positive (+3). According to the different ''data settings'', we performed fitting experiments on the ''aligned'' and ''unaligned'' data. we tested our model (MEDT) on ''aligned'' data and ''unaligned'' data according to the different ''data settings'' and compared its results with those of the benchmark models. First, we apply our model and the benchmark models to ''unaligned'' data. Compared with the benchmark models, our model achieved significant improvements in all evaluation indicators. As mentioned earlier, when TFN and LMF networks perform multimodal sentiment analysis, each modality data has an equal effect on the final sentiment result. Our model MEDT takes natural language data as the dominant information iteratively updates the weight ratio of non-natural language data to natural language data, and dynamically obtains multimodal fusion feature representations to obtain emotional results. As can be seen from Table. 2, our method has a substantial improvement in the evaluation indicators of classification or regression compared with the previous two networks. For example, on the ''unaligned'' MOSI data, especially in the regression task, the mean squared error (MAE) dropped directly by 23.61 points, and in the binary classification accuracy metric and F1 score, it also steadily increased by 5 points. Moreover, compared with the state-of-the-art model Self-MM, the accuracy has also been improved. Then, for the ''aligned'' data, we applied the MISA and Self-MM models that performed well on the ''unaligned'' data to the ''aligned'' data, and our model still had the best results. In addition to the basic evaluation indicators, we also recorded the results of five experiments and calculated the standard deviation to show the stability of the model. The results show that the standard deviation of our model for all evaluation indicators is low, and the fluctuation is generally maintained between 0.2-0.9. Compared with other models, our model has strong stability and resistance to randomness, which means that the model can obtain relatively stable output results under different conditions. In Fig. 3, we show the results of five experiments on the Pearson correlation (Corr) indicator of the two datasets in the regression task. Our model guarantees a high Corr index while also ensuring the stability of the model.

B. DISCUSSION
Weighting each modality data and enhancing the associated modality data for a particular task can better achieve the desired results. Our model further validates this idea. However, our model is less flexible than the method that automatically obtains the dominant mode data by building a weight distribution model. Taking the three modalities of natural language (language) and non-natural language (audio and visual) in this paper as examples, our model assumes in advance that natural language is dominant for emotion acquisition. Taking non-natural language as the influencing factor of natural language offset dynamically adjusts the weight value between natural language and non-natural language. This idea is slightly different from the above, the acquisition of the dominant modal data is changed from the model's selflearning to artificial assumption, but the final presentation result is ideal. Therefore, this provides a good idea for our next step, that is, how to let the model learn and determine the dominant modal data autonomously. At the same time, it can be seen from Table. 2 that the evaluation results of our model on aligned data and unaligned data are also slightly different, which is reflected in the fact that the evaluation indicators of unaligned data are generally better than aligned data. This is because the multi-modal feature fusion used in the early stage is based on the cascade of feature vectors, which requires multiple modal sequence data to be consistent in the time dimension, which is convenient for model building. But this leads to the unavoidable loss of some information. Our model does not need to take this factor into account, and access to more information ensures the superiority of our model.

C. QUALITATIVE ANALYSIS
In order to verify the more intuitive performance of our model in regression tasks, in Table. 3, we selected multimodal samples from the MOSI dataset to display the results. We applied the model to ''aligned'' data and ''unaligned'' data and fitted the true values separately. Our model has a better fitting effect on samples showing strong emotions. Regardless of whether it is for strong positive emotions or negative emotions, the fitting error of our two experimental results to the true value is maintained at ±0.1; furthermore, for the neutral sample, the fitting error of the emotional result is approximately ±0. 25. Overall, our model shows an excellent emotional fitting effect.

VI. CONCLUSION
In this article, we introduce a Multimodal Encoding-Decoding Network with Transformer (MEDT) for multimodal sentiment analysis. We use different encoding methods for the three multimodal information: for language data, we use a pre-trained BERT model to obtain lexical embedding; and for visual and acoustic language data, we use the transformer's encoder to encode non-natural language data to obtain embedding representation. Finally, we also reconstructed the decoder to obtain cross-modal multimodal embedding representation. Our model finally solves the long-term dependence relationship between specific modalities and multimodalities and considers the offset characteristics of text embedding under the influence of non-natural language information. Our experiments have proven the superior performance of the MDET. However, our model was designed around the idea of taking natural language as the dominant and non-natural language as auxiliary information. This greatly limits the flexibility to switch dominance between multiple modalities in the model. Next, we will focus on the idea of flexibly obtaining dominant information among multiple modal information, and propose a better model for multimodal sentiment analysis. In addition to this, we will also envision better multimodal data alignment methods to better fit our models.

DISCLOSURES
The authors declare no conflict of interest.

APPENDIX I. POSITIONAL EMBEDDING
Because the Transformer model abandons RNN and CNN as the basic model of sequence learning and completely adopts the attention mechanism, the Transformer model does not have the ability to capture sequential sequences. At the same time, the order of arranging the input sequence does not change the behavior of the Transformer or change its output. To solve this problem, following the work of [13], we use sin and cos functions to encode the position information of the sequence of length T , and the frequency is determined by the feature dimension index. In particular, we define the positional embedding (PE) of the sequence X ∈ R T ×d (where T is the length) as a matrix, where: where pos is the position index in the time dimension, i is the dimension, and the value is 0, d 2 . Therefore, each feature dimension of PE is a position value showing a sinusoidal pattern. After calculation, the position embedding is directly added to the sequence, so that X + PE encodes the position information of the element at each time step.
QINGFU QI was born in Liaocheng, Shandong, in 1997. He received the bachelor's degree from the Qilu University of Technology, in 2019. In 2019, he studied for a master's degree in control engineering at Tianjin University of Science and Technology. He is currently pursuing the joint master's degree with the Tianjin University of Science and Technology and the Tianjin Sino-German University of Applied Sciences. Currently, he has published an academic paper as the first author. His research interest includes multimodal sentiment analysis. He is a member of Tianjin ''131'' Innovative Talent Team. In the past few years, he has published more than ten high-level papers, of which the first author or corresponding author has published two SCI searches, six EI searches, and four national invention patents. The main research projects consist of four horizontal topics with funding of more than 1.5 million and three provincial and ministerial-level topics with funding of more than ten million. His research interests include machine learning, multi-modal analysis, water, air communication, and heterogeneous detection.
CHENGRONG XUE was born in Tongcheng, Anhui, in 1996. He received the bachelor's degree from Huaibei Normal University, in 2019. He is currently pursuing the master's degree in electronic information with the Tianjin University of Science and Technology, with a focus on natural language processing and machine vision. VOLUME 10, 2022