TA-SBERT : Token Attention Sentence-BERT for Improving Sentence Representation

A sentence embedding vector can be obtained by connecting a global average pooling (GAP) to a pre-trained language model. The problem of such a sentence embedding vector using a GAP is that it is generated with the same weight for all words appearing in the sentence. We propose a novel sentence embedding-method-based model Token Attention-SentenceBERT (TA-SBERT) to address this problem. The rationale of TA-SBERT is to enhance the performance of sentence embedding by introducing three strategies. First, we convert the base form while preprocessing the input sentence to reduce misunderstanding. Second, we propose a novel Token Attention (TA) technique that distinguishes important words to produce more informative sentence vectors. Third, we increase stability of fine-tuning to avoid catastrophic forgetting by adding a reconstruction loss to the word embedding vector. Extensive ablation studies demonstrate that our TA-SBERT outperforms the original SentenceBERT (SBERT) in the sentence vector evaluation using semantic textual similarity (STS) tasks and the SentEval toolkit.


I. INTRODUCTION
Natural language processing (NLP) is a field of machine learning that enables computers to recognize and analyze human language. The performance of the field varies greatly depending on how the language is vectorized through embedding for natural language data entered by input. Conventional static word embedding such as Word to Vector (Word2Vec) [1], FastText [2], and Global Vectors for Word Representation (GloVe) [3] cannot produce word embedding vectors that reflect the meaning of context. This problem has been solved by the emergence of contextualized word embedding such as Embeddings from Language Models (ELMo) [4], Bidirectional Encoder Representations from Transformers (BERT) [5], Generative Pre-Training (GPT) [6], and Robustly Optimized BERT Pre-Training Approach (RoBERTa) [7], which progresses the supervised learning using the average for the output vectors of BERT, performs efficient sentence embedding.
Transformer [8] is a base model to extract contextualized word embedding. The main idea of Transformer is that multi-head attention is used to process data in parallel while maintaining the time sequence attribute of time-series data by adding positional embedding to the embedding layer.
Unlike the recurrent neural network (RNN), which cannot capture long-term dependencies due to gradient vanishing, Transformer solves this problem by processing input data simultaneously without sequential processing using a selfattention mechanism and a full connected layer. On the other hand, the convolution neural network (CNN) has limited receptive fields due to the kernel size limit, whereas Transformer uses the attention module computing the interaction among all inputs. Recent pre-trained language models (e.g., ELMo, BERT, GPT, and RoBERTa) are trained with large corpora using only the encoder or decoder of Transformer, dramatically improving the performance of downstream tasks in NLP.
The limitation of a pre-trained language model is that it requires extensive time in sentence pair regression such as clustering and sentence similarity analysis. The problem can be solved by embedding the sentence. Recently, various sentence embedding methods are proposed with different generating mechanisms such as the average of word em-bedding vectors, special tokens from a pre-trained language model etc. To our knowledge, Sentence-BERT (SBERT) [9], which progresses the supervised learning using the average for the output vectors of BERT, performs efficient sentence embedding.
Even the same word has different meaning in different sentences. Therefore, when interpreting a sentence, the influence of a word in the sentence depends on its role and context. However, the traditional sentence embedding method used in SBERT does not take these characteristics into account because it basically assumes that all words appearing in the sentence have the same weights. To address this issue, we introduce Token Attention-SBERT (TA-SBERT), which produces more informative sentence vector by evaluating higher weighting values for important words. Consequently, our method improves the performance of downstream tasks due to the high quality of sentence vectors.
Our method has three main contributions. First, when an apostrophe (') symbol appears in a sentence, the existing tokenizers merely divide them into token units without considering the base form of the input sentence. In our method, the abbreviations are converted to base form before dividing them into token units and then used as an input to the neural network. Second, when we understand a sentence, we do not interpret the sentence with the same weight on the words in the sentence. Our Token Attention (TA) technique is introduced to reflect these behaviors by finding important words and assigning high weights to important words when generating the sentence vector. Third, BERT and RoBERTa perform a masked-language modeling (MLM) task when pretraining the neural network using large datasets. This MLM task is performed by changing some tokens by randomly selecting using three methods (mask, replace, and maintain) and then matching the input token by adding a linear projection layer to the output vector. However, when only sentence vectors are trained, pre-trained weights change significantly and stability of fine-tuning decreases. We address this issue by adding the reconstruction loss to the word embedding vector to avoid catastrophic forgetting.

II. RELATED WORK
In NLP, the model performance on various tasks depends primarily on how input data are transformed into a vector. In order to vectorize the input data, the tokenization operation is performed first performed. The tokenization includes subword tokenization such as byte pair encoding (BPE) [10], WordPiece encoding [11], the Unigram Language Model [12], and SentencePiece [13].
Subword tokenization solves the Out Of Vocabulary (OOV) problem, which was a problem for tokenizers in the past that generates tokens conveying the meaning of words and can adjust the number of words in the vocabulary. The BPE method divides the text into character units and combines the most frequent combination of characters based on the frequency in a given dataset. In contrast, the WordPiece tokenizer combines characters with the highest likelihood.
Even if the input word is not in the vocabulary, tokenization can proceed by combining the word with other tokens in the vocabulary.
In contrast to BPE or WordPiece, the Unigram algorithm defines the loss of training data given to the current vocabulary and Unigram language model. For each symbol in the vocabulary, the algorithm calculates how much the overall loss increases when the symbol is removed from the vocabulary and generates a vocabulary that gradually removes the vocabulary of the current model. The tokenization methods mentioned so far assume that input text uses spaces as a delimiter. In contrast, SentencePiece can train subword models directly from raw sentences without pre-tokenized into word sequences.
The task of encoding the meaning of tokens and changing them to vectors to perform text analysis is referred to as word embedding. Static word embedding methods include techniques such as Word2Vec [1], FastText [2], and GloVe [3]. Word2Vec has two models. Continuous Bag of Words (CBOW) predicts middle words with surrounding words, whereas Skip-Gram predicts surrounding words with middle words. FastText is an extension of Word2Vec and trains by considering that there are several subwords in one word. GloVe trains by reflecting the statistics of the whole corpus and the relationship between surrounding words.
These static word embedding methods store the trained vector and use the stored vector to vectorize input data. Because even the same word has different meanings, there is a problem that the stored vector may not be able to express the corresponding input. This problem can be addressed using a pre-trained language model that extracts contextualized word embeddings, such as ELMo [4], BERT [5], GPT [6], and RoBERTa [7].
The pre-trained language models are constructed using only the encoder or decoder of the transformer and perform pre-training on word embeddings on large corpora. These models can extract a word vector of contextual meaning using a value obtained by performing tokenization and vectorization of input data as the input of a language model. Contextualized word embedding vectors that consider the context of input data outperform static word embedding vectors in various downstream tasks.
Most NLP models extract sentence vectors using word embedding vectors by adding a global average pooling (GAP) to the end of a model. Sentence to Vector (Sent2Vec) [14] extracts a sentence vector using a GAP. The training method of Sent2Vec is the same as that of Word2Vec. The difference between them is that Sent2Vec dynamically adjusts a window size to fit the length of the sentence instead of static window size to create a training dataset.
Skip-Thought [15] has the RNN model configuration of an encoder-decoder and trains the neural network by generating front and rear sentences from the decoder using the hidden state value of the last layer of the encoder. The final hidden state value from the trained model encoder is used as a sentence vector. InferSent [16] generates sentence vectors through max-pooling of word vectors obtained using the bidirectional long short-term memory (BiLSTM) model. InferSent consists of Siamese networks and uses the NLI dataset (Stanford Natural Language Information data (SNLI) [17] and Multi-Genre NLI data (MNLI) [18]) for training. The framework of natural machine translation with multiple encoders and decoders [19] simultaneously to achieve universal multilingual and modal representations. Multi-task Dual Encoder Training [20] embeds text from 16 languages into a single semantic space using a multi-task trained dual encoder that learns tied representations using translation-based bridge tasks.
Universal Sentence Encoder (USE) [21] converts transformer encoder-output vectors into sentence vectors using mean-pooling. USE is trained with SNLI, Wiki, and News by configuring Siamese networks. Sentence-BERT (SBERT) [9] uses the value obtained through mean-pooling as a sentence vector for word vectors obtained using BERT and trains this model using Siamese networks and NLI data. SBERT-WK [22] resets the weights of the output vector by using the fluctuation trend of word representation for each layer of the SBERT encoder and extracts the final sentence vector through the weighted sum.
CNN-SBERT [23] extracts a sentence vector by adding a CNN module. The input data length has a fixed value to match the matrix dimension of the CNN module, and a pad is added to enforce the same length for all data. CF-SBERT [24] uses the Siamese BERT and extracts the sentence vector using a GAP. A new sentence is generated using important component data obtained using Part-Of-Speech (POS) tagging from input data. When training and inferencing CF-SBERT, the original and generated sentence are grouped and used as input.
This paper demonstrates that performance can be improved without increasing the size of datasets or the number of trainable parameters by preprocessing the input data. Furthermore, an attention module and reconstruction layer are added to SBERT to enable the model to train important words from the input data and avoid catastrophic forgetting.

III. PROPOSED METHOD AND MODEL ARCHITECTURE A. BASE FORM CONVERSION METHOD
During the tokenization of each language model, even the same input sentence may have different tokens according to the vocabulary of the tokenizer. The tokenizer vocabularies used by the pre-trained BERT and RoBERTa have 30,522 and 50,257 words, respectively. An example of the results using the BERT and RoBERTa tokenizers is depicted in Fig. 1(a) and 1(b). The BERT tokenizer with a relatively small vocabulary decomposes the apostrophe, whereas the RoBERTa tokenizer does not decompose the apostrophe (').
In contrast, both tokenizers generate a "won" token. Unfortunately, the word "won" has various meanings, such as the past form of "win" and Korean monetary units. When we see the word "won't," we understand it as "will not" intuitively, but a machine cannot. We address this issue by preprocessing  The base form conversion method can easily remove apostrophes in the input sentence. The apostrophe has various uses. For example, it is used to form possessive nouns and represent the omission of letters. If an input sentence containing abbreviated words using apostrophes is input into a natural language model, this model may not interpret the word as intended. We reduce this uncertainty by removing apostrophes as extensively as with preprocessing using the spaCy Python library, as depicted in Fig. 1(c). Our base form conversion method improves performance by intuitively helping to analyze the contextual meaning of words inside the neural network when fine-tuning is performed (Section IV-B).

FIGURE 2: Our proposed Token Attention architecture
In interpreting the meaning of a sentence, the importance of every word in the sentence differs. For example, when a sentence is classified into positive and negative, specific words of affirmation control the overall context of the sentence. The rest of the words are auxiliary in providing additional explanations. Therefore, while generating the sentence vector, it is more reasonable to adjust the weights dynamically than to treat them equally as in a GAP. We propose a novel TA method that finds dynamic weights for important words in the sentence. Fig. 2 illustrates the proposed TA architecture. The word VOLUME 4, 2016 embedding vector value obtained from the pre-trained lan- is used as the TA input. The query, key, and token matrices are generated from the dot product of matrix E, and each In (6), Q, K, and T are query, key, and token (Q ∈ R s×d , K ∈ R s×d ). Token has the form of T ∈ R s×1 and activates only important features for each token.
Attention map A is calculated according to (2). The attention map contains information on how closely the input word embedding vector E is related to other words and is the same as the single-head attention query and key calculation method. Refer to (3), it calculates the influence of tokens in a sentence. s max in (3) refers to the temperature scaler for TA, which is used to control the variance of O based on the input sentence length.
The sentence embedding vector V is calculated as shown in (4) by summing vectors multiplied by element-wise between the influence of the token obtained as a result of TA and the word embedding vector E. In (4), O means a vector adjusted to match the dimensions of O in order to perform element-wise multiply with E. Using Token Attention, we can obtain sentence embedding vector weighted according to the importance of the input token.

C. RECONSTRUCTION LOSS FOR AVOIDING CATASTROPHIC FORGETTING
Language models are pre-trained primarily using the MLM task. The task randomly selects some tokens (15%) from among the tokens of the sentence entered as input. Then, 80% of the selected tokens are replaced by [MASK] tokens, 10% by changing other random words, and the remaining 10% is left unchanged. And then a model trains to predict actual IDs of randomly selected tokens. Our reconstruction loss is constructed differently from the MLM task of the training method. In the MLM task, a random subset of the tokens is masked, and the objective function is used to predict the correct identities of the masked tokens. However, we must consider all words in an input sentence without masking any word to generate the importance of all words. Therefore, the reconstruction loss is set to predict the IDs of all tokens. The purpose of this configured reconstruction loss increases the stability of fine-tuning to avoid catastrophic forgetting problem.
In (5), the linear projection (W r ∈ R C×d ) is used to match the word embedding vector E with the vocabulary of the tokenizer. We use cross-entropy loss to optimize the reconstruction objective function between the input tokens ID and r.

D. TRAINING AND INFERENCE MODEL
SBERT, based on BERT and RoBERTa, has the Siamese and triplet networks. Similar to SBERT, our TA-SBERT has the Siamese network structure, as illustrated in Fig. 3. The proposed TA-SBERT uses the reconstruction loss and has two objective functions: (i) the sentence objective function trains the sentence vector and (ii) the reconstruction objective function trains the reconstruction loss. The sentence objective function is calculated using the sentence vector for the sentence pair as described in (6).
In (6), u and v are sentence embedding vectors of the sentence pair data obtained by the pre-trained language model and Token Attention though (1)-(4). We concatenate the sentence embedding vectors u and v with element-wise difference |u−v| and multiply it with trainable weight matrix W s ∈ R k×3d to match a dimension of label classes in the dataset. k means the number of label classes in the dataset. We use cross-entropy loss to optimize the sentence objective function between the target labels in the dataset and s in (6). The reconstruction objective function is calculated as shown in Section III-C When our model trains at a 1:1 ratio between the sentence objective function and the reconstruction objective function, it tends to overfit the reconstruction task.
L total =L sen (u, v) + 0.017 × (L recon (r 1 ) + L recon (r 2 )) Therefore, we introduce a temperature scaler to match the convergence rate between the sentence objective function and the reconstruction objective function as shown in (7). r 1 and r 2 are generated by the word embedding vectors of the sentence pair data as described in (5). Based on these observations of our extensive experiments, the optimal ratio of the temperature scaler between the sentence objective function and the reconstruction objective function is 1:0.017.  Fig. 4 illustrates the structure for inferencing the trained model. First, the input sentence is converted to a base form (Section III-A) and a contextualized word embedding vector is obtained through a language model. Second, the contextualized word embedding vector is used so TA can obtain the importance of input words. Finally, the sentence vector is extracted based on (4) using both the contextualized word embedding vector and the importance from the TA.

IV. EXPERIMENT
In this section, we analyze the efficacy of our three main techniques to improve the performance of the sentence vector. We evaluate the performance of TA-SBERT for semantic textual similarity (STS) and classification tasks. For evaluation of STS tasks, we use the cosine similarity to compare the similarity between two sentence vectors. The Pearson's and Spearman's rank correlation coefficients are calculated between the cosine similarity of the sentence vectors and gold labels. For classification tasks, we use the SentEval [25] toolkit for measuring the quality of the sentence vector. We use pre-trained BERT and RoBERTa from Hugging Face 1 [26] A

. TRAINING DETAILS
In all experiments, we train TA-SBERT using NLI dataset combined SNLI with MultiNLI. The SNLI is a collection of 570k sentence pairs labeled as entailment, contradiction, and 1 https://huggingface.co/ neutral. The MultiNLI corpus is a collection of 433k sentence pairs annotated with textual entailment information. we finetune TA-SBERT with one sentence objective function using the NLI dataset and two reconstruction objective functions. We train all our models using a batch size of 16, an epoch of 1, and the AdamW optimizer with a linear learning rate warm-up of 10% of the training data. We use a learning rate of 2e-5 for BERT and RoBERTa as pre-trained language models and 3e-5 for TA and other parameters. All parameters of TA are initialized to a uniform distribution with a mean of 0 and a variance of 0.02.  The first set of experiments is designed to evaluate the effectiveness of converting input words to their base forms. This experiment is conducted by fine-tuning the pre-trained BERT and RoBERTa. As presented in Table 3, the result of this experiment shows that the basic form conversion method does not contribute to performance improvement if it is used alone.

B. BASE FORM CONVERSION EFFECT
In contrast, the use of both TA and base form conversion methods produce meaningful effects, as depicted in Fig. 5, using the STS benchmark (STS-B) [27] dataset. The average performance is improved in all four models. The Pearson's and Spearman's rank correlation coefficients average increases of 0.15 points and 0.175 points, respectively, when applying the base form conversion method to the input sentence. This result demonstrates that the base form conversion method is not effective when applied only to a pre-trained language model but is effective in training TA.

C. TOKEN ATTENTION VISUALIZATION
We design our TA technique to focus on important words in the sentence. We illustrate the effectiveness of our TA VOLUME 4, 2016   Tables 1 and 2. Table 1 presents the output of the trained TA in a model without reconstruction loss. The TA module emphasizes more informative words. In contrast, Table 2 presents the output of the trained TA with the reconstruction loss. The TA module with the reconstruction loss focuses not only on important words but also on words influenced by surrounding words.

D. FIND THE APPROPRIATE LEARNING RATIO BETWEEN SENTENCE LOSS AND RECONSTRUCTION LOSS
(a) Pearson's correlation by difference of temperature scaler (b) Spearman's rank correlation by difference of temperature scaler In this section, we conduct experiments for finding the optimal temperature scaler between the sentence objective function and reconstruction objective function. TA-SBERT is trained using the NLI dataset. We evaluate our model on the STS-B dataset. The experiment is conducted by setting the temperature scaler of the reconstruction loss from 0.001 to 0.02. Fig. 6 illustrates the experimental results obtained under these settings, and the red line denotes the average value obtained in each case. The x-axis represents the value of the temperature scaler of the reconstruction loss, while the y-axis of Fig. 6(a) represents the Pearson's correlation coefficient, and the y-axis of Fig. 6(b) represents the Spearman's rank correlation coefficient.
The result of the experiment without using reconstruction loss is described in Fig. 5(a). The values of the Pearson's and Spearman's rank correlation coefficients are 73.31 and 73.7, respectively when the base form conversion method is not used. These values increase to 73.55 and 73.84, respectively, when the base form conversion method is used, as depicted in Fig. 5(a). In contrast, if we add a reconstruction loss, the values of the Pearson's and Spearman's rank correlation coefficients increase to 76.81 and 77.18, respectively, as depicted in Fig. 6.
In this temperature scaler experiment, the temperature scaler of the reconstruction loss performs optimally when the value of the temperature scaler is at 0.017. Based on this result, we set the temperature scaler between the sentence objective function and the reconstruction objective function to 0.017 as shown in (7).

E. EVOLVING WORD REPRESENTATION
We verify the efficacy of the reconstruction loss in word representation from the encoder layers in Figs  the cosine similarity between hidden states of encoder layers. Figs. 7-8 are the result of obtaining the word representation in the sentence "A man is feeding a mouse to a snake." The x-axis represents the encoder layer of models. The yaxis represents the cosine similarity between the current and previous layers. Fig. 7(a) illustrates that the word representation changes slightly inside when using the pre-trained BERT-baseuncased. Furthermore, Fig. 7(b) is the result of the word representation when training only the sentence vector using SBERT. The word representation is maintained up to the eighth encoder layer and then decreases rapidly. Consequently, the existing parameter weights change drastically while training the sentence vector. Fig. 8(a) reveals that SBERT with the base form conversion method has similar results as Fig. 7(b). If we add TA to this model, the cosine similarity increases for important words, whereas it decreases for other words, as depicted in Fig. 8(b). Moreover, as depicted in Fig. 8(c), when the reconstruction loss is added to the model in Fig. 8(b), the cosine similarity is highest than other models. In particular, the cosine similarity of "snake", "feeding", and "mouse", which are important words in interpreting the input sentence, is increased. The results show that the model trained by using our methods activates for important words.

F. CATASTROPHIC FORGETTING
Catastrophic forgetting [28] means the tendency of a neural network to forget pre-trained knowledge upon training new knowledge. Many work propose various methods to avoid catastrophic forgetting : STILTs [29] adds supervised learning before fine-tuning to avoid model overfitting and catastrophic forgetting. BERT's text classification fine-tuning method [30] adjusts the learning rate to avoid catastrophic  forgetting. Mixout [31] introduces a new regularization that prevents catastrophic forgetting by increasing the stability of fine-tuning. To show the effect of reconstruction loss on catastrophic forgetting, we experiment with the stability of fine-tuning at various learning rates based on the SBERT model. We perform fine-tuning of SBERT with reconstruction loss or without reconstruction loss using NLI's train dataset and use NLI's test dataset for evaluation. The hyperparameter settings are the same as in Section IV-A, except that only the learning rate changes. A fine-tuning is performed 5 times with different random seeds for each learning rate. Fig. 9 shows cross-entropy loss and accuracy according to various learning rates. In Fig. 9(a) and (b), a solid line VOLUME 4, 2016  3: This is the result expressed by multiplying the (Pearson correlation coefficient / Spearman's rank correlation coefficient) value by x100 between the value obtained through the cosine similarity between sentence vectors obtained using the learned model and the value of the gold labels. The higher the value, the higher the correlation, and it means that the prediction label value obtained using the sentence vector obtained by the model and the gold label value are similar.  represents train loss, and a dotted line represents test loss. In Fig. 9(c) and (d), a red line indicates average accuracy. Even if the learning rate increases, the stability of the finetuning with reconstruction loss is maintained, but the stability of the fine-tuning without reconstruction loss becomes unstable. Based on these results, the reconstruction loss has the effect of preventing catastrophic forgetting by increasing the stability of fine-tuning.

G. UNSUPERVISED STS TASKS
We evaluate the performance of the STS tasks using finetuning TA-SBERT on the NLI training dataset. The datasets we used to evaluate are the STS tasks 2012-2016 [32]- [36], the STS benchmark, and the SICKRelatedness [37] dataset. These datasets represent the semantic relevance of the sentence pair as a value between 0 and 5. We determine the similarity between sentence vectors using the cosine similarity of the vector of the input sentence pair obtained using TA-SBERT. The results of the Pearson's and Spearman's rank correlation coefficients between sentence vectors and gold labels are presented in Table 1 the sentence vector generated by TA-SBERT is superior to SBERT for all datasets. In Table 3, "+form" indicates the base form conversion method, "+att" the TA, and "+recon" the reconstruction loss. The base form conversion method(+from) does not affect performance (Sec-tion IV-B). However, when our base form conversion method is used with the TA technique, the Pearson's and Spearman's rank correlation coefficients increase on average by 0.62 and 0.57, respectively, compared with SBERT.
Moreover, when the reconstruction loss is additionally used, the Pearson's and Spearman's rank correlation coefficients increase on average by 1.5 and 1.38, respectively, compared with SBERT. Consequently, the three proposed techniques the base form conversion method, TA, and the reconstruction loss are significant in generating the sentence vector by reducing the uncertainty of abbreviations, focusing on important words, and avoiding catastrophic forgetting.
The BERT-base-uncased has 768 dimensions of word vectors and 12 encoder layers, whereas the BERT-large-uncased has 1,024 dimensions of word vectors and 24 encoder layers. Even in these settings, the sentence vector generated by BERT-base-uncased with our proposed techniques illustrates similar performance as the sentence vector generated by BERT-large-uncased in the STS task.

H. SENTENCE VECTOR PERFORMANCE
SentEval is a toolkit used to evaluate the quality of sentence vectors. It uses sentence vectors as features to train logistic regression classifiers. When training the logistic regression classifier, we set all parameters to default and evaluate using the following seven datasets: • MR: Movie Review dataset for the sentiment classification task. [38] • CR: Customers Reviews dataset for the sentiment classification task. [39] • SUBJ: Subjectivity prediction from movie reviews. [40] • MPQA: Phase-level opinion polarity dataset for the sentiment classification task. [41] • SST: Stanford Sentiment Treebank dataset for the sentiment classification task. [42] • TREC: Question-type dataset for the multi-class classification task. [43] • MRPC: Microsoft Research Paraphrase Corpus from newswire articles for the binary classification task. [44] Table 4 presents the accuracies of the logistic regression classifier using these seven datasets, demonstrating that our TA-SBERT exhibits the highest average accuracy.

V. CONCLUSION
We propose a novel sentence embedding-method-based model, TA-SBERT, with three original contributions. First, we preprocess the input data by converting the word to the base form and removing apostrophes to reduce the uncertainty of the word's meaning. Second, we introduce the TA method to assign weights dynamically to important words in a sentence. Third, we propose the reconstruction loss to avoid catastrophic forgetting by increasing stability of fine-tuning.
We conduct experiments on STS tasks and classification tasks for seven datasets using TA-SBERT. TA-SBERT exhibits average increases of 2.04% and 1.86% for the Pearson's and Spearman's rank correlation coefficients in STS tasks compared with SBERT. In classification tasks using the SentEval toolkit, the average accuracy increases by 0.53%.
As future work, we plan to apply the Token Attention and the reconstruction loss to other downstream tasks including the task of summarizing using sentence vectors or the task of rating each sentence by importance in a long document.
SANGWON LEE received a B.S. degree in Mechanical Engineering from Inha University, Incheon, Republic of Korea, in 2013 and 2020. He is currently pursuing a M.S. in Electrical and Computer Engineering at Inha University under the guidance of Professor Choi. His research interests are deep learning, natural language processing, recommendation system, and big data LING LIU is a Professor in the School of Computer Science at Georgia Institute of Technology. She directs the research programs in the Distributed Data Intensive Systems Lab (DiSL). Prof. Liu is an IEEE Fellow and a recipient of IEEE Computer Society Technical Achievement Award in 2012. She has published over 300 international journal and conference articles and is a recipient of the best paper awards from numerous top venues. Prof. Liu served as the EIC of IEEE Transactions on Service Computing (2013-2016), and serves on the editorial board of half a dozen international journals.
WONIK CHOI is a Professor in the School of Information and Communication Engineering at Inha University, where he runs the Data Intelligence Lab. He received his PhD in Computer Engineering from Seoul National University, Korea. His research interests include spatio-temporal databases, sensor network topology, telematics, and GIS/LBS. He was a visiting scholar in the School of Computer Science at Georgia Institute of Technology in 2012.