Leveraging Concept-Enhanced Pre-Training Model and Masked-Entity Language Model for Named Entity Disambiguation

Named Entity Disambiguation (NED) refers to the task of resolving multiple named entity mentions in an input-text sequence to their correct references in a knowledge graph. We tackle NED problem by leveraging two novel objectives for pre-training framework, and propose a novel pre-training NED model. Especially, the proposed pre-training NED model consists of: (i) concept-enhanced pre-training, aiming at identifying valid lexical semantic relations with the concept semantic constraints derived from external resource Probase; and (ii) masked entity language model, aiming to train the contextualized embedding by predicting randomly masked entities based on words and non-masked entities in the given input-text. Therefore, the proposed pre-training NED model could merge the advantage of pre-training mechanism for generating contextualized embedding with the superiority of the lexical knowledge (e.g., concept knowledge emphasized here) for understanding language semantic. We conduct experiments on the CoNLL dataset and TAC dataset, and various datasets provided by GERBIL platform. The experimental results demonstrate that the proposed model achieves significantly higher performance than previous models.


I. INTRODUCTION
Named Entity Disambiguation (NED) is important for various Natural Language Processing (NLP) tasks such as question answering and dialog systems [1]- [3]. Although current neural network based approaches have advanced the state-ofthe-art results on NED task [4]- [8], they however failed to model the complex semantic relationships, and multiple signals (i.e., words and entities etc.,) can not be fully interplayed in their architectures. On the other hand, language model pre-training has been shown to be effective for improving many NLP tasks [9]- [11], relying on its ability of representing complex context. Hence, this study tries to test the effectiveness of the pre-training contextualized embeddings for NED task. In this paper, we describe a novel unsupervised pre-training model for words and entities towards NED task.
Conventional unsupervised pre-training models have been shown to facilitate a wide range of downstream applications, however still encode only the distributional knowledge, The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao . incorporated through masked language modeling objectives. To enhance the representation ability of pre-training mechanism, this paper introduces extra lexical knowledge, which has been proved to be effective in helping understanding semantic in many NLP tasks [12]- [14] and could be combined with the distributional knowledge easily. Especially, we couple BERT's Masked Language Model (MLM) objective [9] with a novel Concept Correlation Prediction (CCP) objective available from lexical knowledge graph Probase [12], [15], through which we inject prior lexical knowledge (i.e., concept knowledge from Probase [12], [16]) into the basic BERT. Wherein, CCP could be viewed as an additional classifying task, which aims at identifying valid lexical semantic relations with the concept semantic constraints derived from external resource Probase.
Furthermore, we leverage another objective for the aforementioned concept-enhanced BERT, to release our ultimate pre-training model towards NED task. Concretely, inspired by Masked Language Model (MLM) objective [9], we propose the Masked-Entity Language Model (MELM), which is a novel classifying task that aims to train the embedding model by predicting randomly masked entities based VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ on words and non-masked entities in the given input-text sequence.
In conclusion, the proposed pre-training model for NED task, feeds two novel objectives, i.e., Concept Correlation Prediction objective and Masked-Entity Language Model objective, to the traditional pre-training framework (i.e., BERT [9] utilized in this paper). Therefore, the proposed pre-training NED model could merge the advantage of pretraining mechanism for generating contextualized embedding with the superiority of the lexical knowledge (e.g., concept knowledge emphasized here) for understanding language semantic.
For evaluation, we train the proposed pre-training NED model using texts and their entity annotations retrieved from Wikipedia and news articles. Finally, we test the proposed model using two standard NED datasets and the GERBIL platform, which is a benchmark platform that includes many state-of-the-art NED models' results as well as quite a few datasets. The experimental results reveal that our model outperforms state-of-the-art models on both datasets. The contributions of our paper are concluded as follows: (i) We leverage extra lexical knowledge (i.e., concept knowledge emphasized in this paper) for supplementing unsupervised pre-training models, by coupling masked language model objective with a novel concept correlation prediction (CCP) objective. To the best of our knowledge, it is the first work for adapting the concept relation semantic (e.g., comparative concept correlation) into pre-training framework. (ii) A novel masked-entity language model (MELM) is introduced here, aiming to train the contextualized embedding model by predicting randomly masked entities based on the words and non-masked entities in the given input-text. (iii) With efforts above, towards named entity disambiguation (NED) task, this paper proposes a novel unsupervised pre-training model for words and entities, combing the advantages of both pre-training mechanism and prior lexical knowledge. (iv) Finally, the experimental results demonstrate that, not only the overall proposed pre-training NED model but also the separately concept-enhanced BERT, yields performance gains over current state-of-the-art models on NED task.

II. RELATED WORK AND MOTIVATION A. UNSUPERVISED PRE-TRAINING APPROACHES
Language model pre-training, such as ELmo [10], GPT/GPT2 [11], [17] and BERT [9], has been shown to be effective for improving many Natural Language Processing (NLP) tasks. ELMo [10] generalized traditional word embedding research along a different dimension, and proposed to extract context-sensitive features from a language model. GPT [17] enhanced the context-sensitive embedding by adapting the Transformer [18]. Like traditional static embedding models, unsupervised pre-training models also relied only on large text corpora. E.g., the basic BERT model [9], which was designed to pre-training deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, consumed only the distributional information, ignoring the extra lexical knowledge. There has existed research trying to introduce prior semantic knowledge to pre-training scheme. E.g., [19] introduced an entity-level masking strategy to ensure that all of the words in the same entity were masked during word representation training, instead of only one word or character being masked; [20] updated contextual word representations via a form of wordto-entity attention, by inserting prior knowledge into a deep neural model. On the other hand, previous work has demonstrated that, leveraging extra lexical knowledge (e.g., concept etc.,) can significantly boost the efficiency of contextualized embeddings for word [15], entity [21], relation [22], sentence [23], and so on. Overall, the lexicon-enhanced contextualized embedding representation produced by these models has produced substantial gains in a number of downstream NLP tasks. SHORTAGE: Almost current pre-training models still encode only the distributional knowledge (provided by word surface), ignoring the beneficial lexical semantic (e.g., concept information) from extra high-quality knowledge source for understanding language.
MOTIVATION: Therefore, this paper aims at investigating whether the supplement of extra lexical knowledge could also produce substantial improvement for unsupervised pretraining models. The goal is to inform the pre-training model on the relation of comparative concept correlation among words: according to prior work on static word embeddings, the sets of extra semantic information (i.e., concepts) are useful to boost the model's ability to capture true semantic relation, which in turn have a positive effect on downstream language understanding applications. Especially, we leverage concept information for enhancing the conventional pre-training model's representation ability, and propose the concept-enhanced pre-training, details in Section IV-B and experimental comparison in Section V-D2.

B. NAMED ENTITY DISAMBIGUATION
Disambiguating among similar entities is a crucial feature in extracting relevant information from texts-Ambiguous information needs to be resolved, requiring additional steps that go beyond grammatical parsing. Early Named Entity Disambiguation (NED) models addressed the problem as a well-studied word sense disambiguation problem [24]- [26], primarily aiming at modeling the similarity of textual context, or modeling the coherence among disambiguated entities in the same context [27]- [29].
More recently, neural network based approaches have advanced the state-of-the-art results on NED task, which focused on learning the representations of entities or words. Reference [30] used random walks (RW) on knowledge graphs to construct vector representations of entities and documents to address the problem of NED. Reference [27] used an iterative heuristic to remove unpromising mentionentity edges. Reference [31] presented a graph-based disambiguation approach based on Personalized PageRank (PPR) [32] with combination of local and global evidence for disambiguation. Reference [5] combined benefits of deep learning with more traditional approaches such as graphical models [33] and probabilistic mention-entity maps. Reference [34] proposed a method to map entities into the word embedding space [35] by using entity descriptions in the knowledge graph and applied it for NED task. Reference [4] created a state-of-the-art NED system that models entity-context similarity with word embeddings and entity embeddings trained using the Skip-Gram model [36]. Similarly, deep neural network is introduced for computing representations of entities and contexts of mentions directly from the knowledge graph in [37], and modeling representations of entity-mentions, contexts of entity-mentions, and entities in [38]. Reference [39] also leveraged deep neural networks to learn entity representations such that the consequent pairwise entity relatedness was more suitable than of a standard method (i.e., Wikipedia Link-based Measure which was a method to measure entity relatedness based on its link structure) for NED task. Similarly, [40] utilized an extra Wikipedia Link-based dataset to improve the performance of NED. Furthermore, for model coherence, in [39], hierarchical information in the knowledge graph was utilized to generate embedding and applied to model coherence; in [29], a novel coherence model was introduced with an attention-like mechanism, wherein the score for each entity candidate only depended on a small subset of entity-mentions. Reference [6] proposed a comprehensive approach for short-text entity recognition and linking, by using the concepts of entities as fine-grained topics to explicitly represent the context and model topic coherence. Recently, many research investigated the end-to-end strategy for NED task, e.g., [7] exploited the effectiveness of both knowledge-based and corpus-based semantic similarity methods for NED, and [8] treated relations as latent variables and induced the relations without any supervision while optimizing the NED method in an end-to-end strategy.
SHORTAGE: To be brief, previous neural network based methods have recently achieved strong results on NED. Generally, the key component of these models is an embedding model of words and entities trained using a large knowledge base. These models are typically based on conventional word embedding models that assign a fixed embedding to each word and entity. Although neural network based approaches have advanced the state-of-the-art results on NED task, they however failed to model and represent complex relationships and context, and multiple signals (i.e., words, entities and context etc.,) can not be fully interplayed in their architectures.
MOTIVATION: Hence, we introduce a novel conceptenhanced pre-training mechanism for NED task, because: (i) language model pre-training has been shown to be effective for improving many NLP tasks, relying on its ability of modeling and representing complex semantic relationships and context. Hence, in this study, we aim to test the effectiveness of the pre-training contextualized embeddings for NED task. (ii) prior work on concept-enhanced contextualized embedding (for word, entity and sentence etc.,) has shown that, steering distributional models towards capturing robust semantic representation has a positive impact on natural language understanding. Accordingly, unlike these NED methods, our proposed model involves a novel concept-enhanced pre-training NED models for words and entities, including a novel masked-entity language model (MELM) objective (details in Section IV-C and comparison in Section V-D1) and a novel concept-enhanced pre-training mechanism (details in Section IV-B and experimental comparison in Section V-D2), and hence enables the accurate representation ability for NED task. The proposed model based on pre-training could enable the signals (i.e., the words, the entities and the concepts) to fully interplay to derive the NED results for given text.

III. NOTATION AND DEFINITION
This paper represents vectors with lowercase letters and matrices with uppercase letters. Let x∈ R k be vector of length k, i.e., the embedding dimensionality is k. In this paper, the notation ''token'' refers to word (denoted as w) or entity (denoted as e) occurring in the input-text sequence. V indicates the 30K-WordPiece [41] vocabulary for word, and accordingly |V | indicates the number of all the words in the vocabulary. In this paper, a ''word'' can refer to a linguistic word and a sub-word. Let E and R represent the set of entities and relations respectively, and accordingly |E| indicates the number of entities and |R| indicates the number of types of relations. In this sense, E could be reviewed as the vocabulary for entities in this paper, and actually E is equivalent to the entity set KB [29] described latter in Section V-C1. The embedding matrix is denoted as M, and the projection (weighted) matrix is denoted as W here. Therefore, the matrices of the 30K-WordPiece word token embeddings and entity token embeddings are represented as M V ∈ R |V |×k and M E ∈ R |E|×k , respectively. A knowledge graph (KG) is denoted as G, consisting of a large number of ''fact''s. Generally, the ''fact'' is formed as (h, r, t), and acts as the basic unit of KG, wherein h ∈ E is the head entity, r ∈ R is the relation, and t ∈ E is the tail entity, meaning that there exists relation r between head entity h and tail entity t [42]. Given a knowledge graph G, it contains |E| entities and |R| types of relations.

IV. METHODOLOGY
In this section, we will introduce the proposed pre-training model for NED task in details. First, we sketch the overall general framework (Section IV-A). Then, we describe the details of the proposed pre-training NED model, which consists of: (i) concept-enhanced pre-training (Section IV-B), aiming at identifying valid lexical semantic relations with the concept semantic constraints derived from external resource Probase [12], [15]; and (ii) masked-entity language model VOLUME 8, 2020 (Section IV-C), aiming to train the contextualized embedding by predicting randomly masked entities based on the words and non-masked entities in the given input-text.
In conclusion, the proposed pre-training model for NED task, feeds two novel objectives, i.e., Concept Correlation Prediction objective (Section IV-B) and Masked-Entity Language Model objective (Section IV-C), to the traditional pretraining framework (i.e., BERT [9] utilized in this paper). Therefore, the proposed pre-training NED model could merge the advantage of pre-training mechanism for generating contextualized embedding with the superiority of the lexical knowledge (e.g., prior knowledge about concept emphasized here) for understanding language semantic.  Figure 1 sketches the architecture of the proposed model, which takes a sequence of words, constraints (derived from Probase directly) and entities contained in the inputtext, and then generates embedding representations for each word and entity. Following [18], a multi-layer bi-directional transformer encoder is adopted here: Given a sequence of tokens (words/entities), the proposed model first represents each token in the sequence as input-embedding (i.e., sum of token-self embedding (yellow rectangle in Figure 1), tokentype embedding (green rectangle in Figure 1) and tokenposition embedding (gray rectangle in Figure 1)), and then generates a output-embedding for each token (blue rectangle in Figure 1). Following [9], the proposed model inserts the special indication token [CLS] at the beginning of the inputtext sequence as the first token, and inserts [SEP] at the end of the input-text sequence as the separator token.
The reason why we choose Probase [12] as our extra lexical knowledge source in this paper, is discussed as follows. As described above, this paper aims at investigating understanding language by introducing lexical knowledge source, which consists of concept information. DBpedia [43] etc., belongs to the encyclopedic knowledge graph, and both Probase and WordNet [44] belong to the lexical knowledge graph. Prior wok has demonstrated that, it is essential to utilize lexical knowledge graphs to help the machine to understand the world facts and the semantics. That is, the knowledge of the language should be used. The encyclopedic knowledge graphs contain facts such as Barack Obama's birthday and birthplace, while the lexical knowledge graphs could definitely indicate that birthplace and birthday are properties of a person. Besides, the reason why we choose Probase rather than WordNet is that: Probase is widely used in recent research about text understanding, text conceptualization, text representation, information retrieval, and knowledge graph completion [13], [14], [23], [45]. Probase uses an automatic and iterative procedure to extract concept knowledge from 1.68 billion Web pages. It contains 2.36 millions of open domain terms, and each term is a concept, an instance (respect to an entity-mention occurring in given text in this study), or both. Meanwhile, it provides around 14 millions relationships with two kinds of important prior knowledge related to concepts: concept-attribute co-occurrence (isAttr-buteOf) and concept-instance co-occurrence (isA). Moreover, Probase provides huge number of high-quality and robust concepts without builds [12]. Therefore, lexical knowledge graph Probase is utilized in this paper for leveraging lexical semantics for boosting NED's efficiency.

B. CONCEPT-ENHANCED PRE-TRAINING MODEL 1) BASIC PRE-TRAINING MODEL
Reference [9] proposes Bidirectional Encoder Representations from Transformers (BERT), and adopts a Masked Language Model (MLM) pre-training objective, which randomly masks some of the tokens from the input-text. This objective is to predict the original vocabulary ID of the masked word based only on its context. Moreover, BERT learns to predict whether two sentences are adjacent. This task tries to model the relationship between two sentences, which is not captured by the traditional language models. There are two main steps in the BERT framework: (i) pre-training, wherein the model is trained on unlabeled data over different pre-training tasks; and (ii) fine-tuning, wherein all of the parameters are finetuned using labeled data from the downstream tasks. Consequently, this particular pre-training scheme helps BERT to outperform state-of-the-art techniques by a large margin on various key NLP datasets.
Overall, BERT's model architecture is a multi-layer bidirectional transformer encoder based on the original implementation described in [18]. 1 The input-text is usually tokenized to word tokens by using the BERT's sub-word tokenizer (such as 30K-WordPiece [41]). Especially, in order to train a deep bidirectional representation, BERT simply masks some percentage of the input word tokens at random, and then predicts those masked tokens, then the final hidden vectors corresponding to the masked tokens are fed into an output softmax over the vocabulary V . This procedure is called ''Masked Language Model (MLM)'' in BERT. However, we argue that current techniques restrict the power of the pre-training representations. The major limitation is that standard masked language models rely only on distributional knowledge, ignoring the beneficial supplement of extra lexical knowledge (e.g., concept) which has been proved to be effective in helping understanding semantic in many NLP tasks [12]- [14], [22].

2) CONCEPT-ENHANCED PRE-TRAINING
To enhance the traditional BERT with extra lexical knowledge, we propose a novel pre-training model, i.e., Concept-Enhanced BERT, which leverages concept constraints for strengthening language understanding. The goal is to enhance the BERT model based on the concept constraints, according to prior work on short-text conceptualization, text representation learning, knowledge graph completion and natural language understanding [15], [21], [23]. The sets of concept constraints utilized here are based on comparative concept correlation relation, and are useful to boost the basic pre-training model's ability to capture optimal semantic correlation and representation.
Concretely, as mentioned above, basic BERT feeds input token representations to Masked Language Model (MLM) classifier, i.e., predicting the masked tokens. We introduce another pre-training objective classifier: it predicts whether an encoded word triple represents a valid lexical relation (i.e., comparative concept correlation relation: a positive triple example consisting of three words wherein the first word is more related to the second word than the third word on concept-level) or not. Note that, other kinds of lexical relation constraints (e.g., synonym, hypernym-hyponym and so on), could be adapted here because of our model's generality and flexibility.
We derive the aforementioned word triple as concept constraints from high-quality lexical knowledge graph Probase [12], with the following form: T = {(w 1 , w 2 , w 3 )}. Especially, following the successful work on short-text understanding and disambiguation [14], [15], [23], this paper select above-mentioned comparative concept correlation as our constraint for intensifying the original pre-training mechanism: Given a triple t = (w 1 , w 2 , w 3 ) ∈ T , word w 1 shares more concepts with word w 2 than word w 3 , wherein instance conceptualization algorithm [16], [46] is adopted here to generate concepts for w 1 , w 2 and w 3 from Probase respectively. We then extract such lexicon-semantic relations to construct concept constraint set T . Concretely, we introduce semantic similarity based on concept [21], for measuring the comparative concept correlation, here: (i) Given word w 1 , we denote its concept-set as C w 1 , consisting of the corresponding concepts deriving from Probase by leveraging instance conceptualization algorithm 2 [16], [46]. E.g., given word ''microsoft'', the instance conceptualization algorithm returns its corresponding concept distribution and different scores , 3 and then generates its concept-set as follows: C microsoft = {COMPANY, VENDOR, CLIENT, FIRM, ORGANIZATION, CORPORATION, BRAND, · · · }. Similarly, the concept-set for other words (w 2 and w 3 ) in this triple could be constructed in the same way. (ii) The semantic similarity between w 1 and w 2 , to measure the distinction of word's concept semantics with the concept information, is defined . Intuitively, each triple in the aforementioned T could be viewed as a ''comparative concept correlation'' constraint.
Since each constraint t = (w 1 , w 2 , w 3 ) corresponds to a true lexical semantic relation, it could be viewed as a positive training example for the model, and hence its corresponding negative examples is created in the following way: (i) We first group positive constraints from T in mini-batches B p ; (ii) For each positive example t = (w 1 , w 2 , w 3 ), we create two negative instancest 1 = (w 1 ,w 2 , w 3 ) andt 2 = (w 1 , w 2 ,w 3 ) such thatw 2 is the word sampled from batch B p (other than w 2 ) closest to w 2 (i.e., with sim(w 2 ,w 2 ) below preset threshold), and similarlyw 3 is the word (other than w 3 ) closest to w 3 , respectively, in terms of the concept correlation of their concepts in Probase.
Next, given a text s = {w 1 , w 2 , · · · , w n }, we make the potential triple-wise combinations among all these tokens to generate triples. If the triple belongs to T (constructed above), we select and retain it into the constraint-set T s for current text. Finally, we transform each constraint t = (w 1 , w 2 , w 3 ) ∈ T s into a ''BERT-compatible'' input format, as follows: We insert the special beginning token [CLS] before w 1 , and then insert the special end token [SEP] after tokens of each of {w 1 , w 2 , w 3 }, as shown in this example for the constraint t = (w 1 , w 2 , w 3 ): . Then we transform each instance into a ''BERT-compatible'' format, i.e., into a sequence of the widely-used 30K-WordPiece [9], [41] tokens: We split w 1 , w 2 and w 3 into 30K-WordPiece tokens. For generating input representation of multi-layer bidirectional transformer, we also sum the {token-self embedding, token-type embedding, token-position embedding} of each token, similar to [9], as shown in Figure 1. Especially, (i) we assign the token-type ID of 1 to all w 1 tokens, and [CLS] token and [SEP] token before and after w 1 tokens; (ii) we assign the token-type ID of 2 to all w 2 tokens, and the [SEP] token after it; and (iii) we assign the token-type ID of 3 to all w 3 tokens, and the [SEP] token after it, as shown in Figure 1. Take Figure 1 as an example, given text ''microsoft unveils office for apple ipad'', constraint triple (apple,microsoft,ipad) exists in current context and then we transform this constraint into following BERT-compatible input format- , as shown in middle part in Figure 1. Note that, the other constraints in T s are all processed with the same manner above, and then sequentially added into the sequence. Therefore, the ellipsis illustrated in Figure 1 represents other constraints with ''BERT-compatible'' format in this context.
Then, we feed these prepared token input representations into a novel objective, i.e., Concept Correlation Prediction (abbreviated as CCP here), which is introduced as another classifier objective into the original BERT. Overall, be the transformed vector representation of the conceptconstraint sequence beginning token [CLS] that encodes the whole comparative concept correlation (w 1 , w 2 , w 3 ) (as shown in Figure 1). Then, our Concept Correlation Prediction (CCP) objective could also be defined by the widely-used softmax strategy: wherein, W CCP ∈ R 2×k and b CCP ∈ R 2 indicate the weighted matrix and output bias for Concept Correlation Prediction objective, respectively, which are parameters to be trained. With efforts above, the training loss function of the proposed concept-enhanced pre-training model, is the sum of the log-likelihood of original Masked Language Model (MLM) objective [9] and the log-likelihood of the proposed Concept Correlation Prediction (in Eq. (1)) objective. Overall, we train all the parameters of the proposed model, by optimizing the aforementioned training loss function.

C. MASKED-ENTITY LANGUAGE MODEL
As discussed in the beginning section, we leverage another objective for the aforementioned concept-enhanced pretraining (Section IV-B) -Masked-Entity Language Model (MELM) objective, which specialises in Named Entity Disambiguation (NED) task and could combine with the concept-enhanced pre-training mentioned-above easily (details in the following Section IV-C2). This novel objective will be described in details in this section. Note that, our Masked-Entity Language Model also inherits the multilayer bidirectional transformer encoder in both BERT and our concept-enhanced BERT mentioned above, which could be viewed as a supplement for each of them, and that's the reason why the proposed Masked-Entity Language Model implements the seamless connection with each of them (as shown in Figure 2 (a)). Actually, the concept-enhanced BERT (Section IV-B) and this Masked-Entity Language Model, ultimately constitute our powerful pre-training NED model, shown in green part in Figure 2 (a).
The input-text is also tokenized to words by using the same 30K-WordPiece [41] vocabulary as BERT and other kinds of variant pre-training models (such as [19]). Given a sequence consisting of words and entities, our Masked-Entity Language Model first represents this sequence as a sequence of input embeddings (Section IV-C1), and then generates a contextualized output embedding for each token. Next, the model randomly masks some entities in the inputtext (shown in Figure 2 (b)), and then trains the embeddings and parameters to predict these masked entities based on the words and other non-masked entities (Section IV-C2). Lastly, the inference procedure towards NED task, is described in Section IV-C3.

1) INPUT REPRESENTATION GENERATION
As shown in Figure 1, in the NED task we represent the words and entities of input-text as a single packed sequence. Therefore, the proposed model generates the input embedding representation for: (i) word, (ii) ''comparative concept correlation'' constraints (defined in Section IV-B), (iii) entitymention with only one word, and (iv) entity-mention with multiple words, respectively. Firstly, following [9] and [19], for the word in i-th position in current input-text sequence, denoted as w i , its input representation is constructed by summing the following embeddings: -Token-self embedding, which is the word embedding of the corresponding word, denoted as w i ∈ R k (derived from 30K-WordPiece word token embedding matrix M V , i.e., w i ⊂M V ); -Token-type embedding represents the type of token, namely ''word-type'', and we assign token-type ID of 0 to each token; -Token-position embedding, which represents the position of w i , denoted as p V i ∈ R k . Secondly, for each concept constraint triple (i.e., ''comparative concept correlation'' constraint described in Section IV-B), the word in i-th position in current sequence is denoted as w i , and its input representation is constructed by summing the following embeddings: -Token-self embedding, which is also the 30K-WordPiece embedding of the corresponding word, denoted as w i ∈ R k (apparently, w i ⊂M V ); -Token-type embedding, which represents ''conceptconstraint type'', is set as discussed in Section IV-B2: we assign the token-type ID of 1, 2 and 3 to each token in the concept constraint triple sequentially; -Token-position embedding, which represents the position of w i , denoted as p V i ∈ R k . Thirdly, for the entity-mention in i-th position containing only one word, denoted as e i , its input representation is constructed by summing the following embeddings: -Token-self embedding, which is the entity embedding of the corresponding entity, denoted as e i ∈ R k (derived from entity token embedding matrix M E , i.e., e i ⊂M E ); -Token-type embedding represents the type of token, namely ''entity-mention type'', and we assign tokentype ID of 4 to each token; -Token-position embedding, which represents the position of e i , denoted as p E i ∈ R k , and apparently p E i =p V i . Lastly, for the entity-mention containing multiple words with span from i-th position to j-th position, denoted as e i , its input representation is constructed by summing the following embeddings: -Token-self embedding, which is the entity embedding of the corresponding entity, denoted as e i ∈ R k (apparently, e i ⊂M E ); -Token-type embedding represents the type of token, namely ''entity-mention type'', and we assign tokentype ID of 4 to each token; -Token-position embedding, which represents the position of e i , denoted as p E i ∈ R k , and can be computed by averaging from i to j as j k=i p E k j−i+1 . Hence, all of the words in the same entity are considered together, like [19].
In conclusion, for the given word or entity in the input-text, its input representation for the proposed pre-training NED model is constructed by summing the following embeddings [9], [19]: (i) Token-self embedding, which is the embedding of the corresponding word or entity (yellow rectangle in Figure 1); (ii) Token-type embedding, which represents the type of token (green rectangle in Figure 1) ; and (iii) Token-position embedding, which represents the position of the given token in the sequence (gray rectangle in Figure 1).

2) TRAINING PROCEDURE
This section describes the details about the procedure of the proposed Masked-Entity Language Model for training the embeddings of words and entities. The proposed model randomly masks some entities in the input-text sequence (shown in Figure 2 (b)), and then trains the embeddings to predict these masked entities based on the words and other non-masked entities, similar to the strategy used in [15], [35], [36] and [9], wherein some word(s) are masked (shown and compared with the proposed Masked-Entity Language Model in Figure 2 (b)). Note that, the entities be masked are known as ''truth-entities'' in this paper, and represented by the special indication token [MASK] . 4 Especially, let o MELM −ME represent the entity output embedding representation of the marked entity. Hence, 4 [MASK] token doesn't appear during the fine-tuning stage. the mask vector m MELM ∈ R k could be defined as follows: wherein, W MELM ∈ R k×k denotes to the weight matrix and b MELM ∈ R k is the bias vector, for the proposed Masked-Entity Language Model objective. Apparently, {W MELM ,b MELM } is part of parameters to be trained. Besides, η(·) represents the activation function, wherein GELUs (Gaussian Error Linear Units) activation function [47] could be adopted here, and τ (·) indicates the layer normalization function. Hence, we could predict the ''truth-entity'' of this masked entity by leveraging the softmax function over all entities in E defined in the knowledge graph G, in the following way as similar to the format of Eq. (1):

3) INFERENCE PROCEDURE FOR NAMED ENTITY DISAMBIGUATION
This section sketches the procedure of Named Entity Disambiguation (NED) based on the aforementioned pre-training NED model. For each entity-mention u with its Top-N entitycandidates occurring in the given input-text sequence, 5 the proposed model creates an input sequence consisting of: (i) a masked entity corresponding to this entity-mention u, (ii) words occurring in the given input-text, and (iii) entity candidate for each of other entity-mentions derived randomly from their corresponding Top-N entity candidates. Then the proposed model takes the input-text sequence mentioned above, and generates the mask vector m u ∈ R k corresponding to this entity-mention u according to Eq. (2). Then, the proposed model predicts the optimal entity for this entity-mention u, by leveraging the softmax function (i.e., Eq. (3)) over its N entity candidates:  [19] introduced entitylevel masking strategy to ensure that all of the words in the same entity are masked during word representation training, instead of only one word or character being masked. There exist essential differences between our model and [19]: (i) the entity-level masking in [19] is still token-level while ours are in reality entity-level; (ii) entity-level masking in [19] could be reviewed as a constraint and it inherits the objective of BERT, while the masked entity model in ours is a novel objective (as shown in Figure 2). The detailed discussion will be provided in the following experimental section.

V. EXPERIMENTS
We test the proposed model using two standard NED datasets and benchmark GERBIL platform. Firstly, datasets and comparative models are listed in Section V-A and Section V-B. Then we describe the experimental settings in Section V-C, which is widely used in domain of NED, and the experimental results in Section V-D. Finally, performance on GERBIL platform comparing with 8 baselines on 10 datasets, is discussed in Section V-E.

A. DATASETS
To evaluate efficiency of the proposed pre-training NED model, we conduct experiments on the dataset CoNLL [27] and dataset TAC [48], which are commonly-used by [4], [5], [8], [29], [49].  [48] is also news-based NED dataset constructed for the Text Analysis Conference (TAC). It is based on news articles from various agencies and Web logdata, and consists of a training and a test set containing 1,043 and 1,013 documents, respectively. In addition, we use several data sources to train our model: (i) We use a Wikipedia snapshot, and hypothesize that retrieving from a large and high-fidelity corpus will provide cleaner language. Therefore, we construct a Wikipedia dataset (denoted as Wiki dataset) for training the proposed model. We preprocess the Wikipedia articles with the following rules. First, we remove the articles less than 100 words, as well as the articles less than 10 links. Then we remove all the category pages and disambiguation pages. Moreover, we move the content to the right redirection pages. Finally we obtain about 3.74 billion Wikipedia articles and 21.3 million entity annotations for indexing and training. (ii) We consider training our model by performing a retrieval on large auxiliary corpora, and the news documents are also used here. Therefore, we construct a news dataset (denoted as News dataset) for training the proposed pretraining NED model. The news articles are extracted from a large news corpus, which contains approximately 2.6 billion articles and 13.5 million entity annotations searched from Reuters, New York Time and so on.
For all datasets, this paper utilizes the standard entity candidate set, i.e., KB+YAGO (details in following Section V-C1), with their associated prior probabilities (P(entity|entity-mention)) following [5], and only Top-30 entity candidates based onP(entity|entity-mention) are considered (i.e., N = 30 in Section IV-C3). For each kind of data source, we generate input-embeddings (Section IV-C1) by splitting the content of each article into sequences consisting of less than 256 words and their entity annotations. Therefore, we trained the proposed model by utilizing the texts and their corresponding entity annotations retrieved from Wiki dataset and News dataset. Note that, only mentions referring to valid entities in Wiki dataset and News dataset can be considered.

B. BASELINES
For evaluating the NED task, we compare the proposed model with the following comparative models: (Hoffart et al., 2011): [27] is a graph-based model aiming at finding a dense sub-graph of entities in an input-text to address the problem of NED. (Cai et al., 2013): [37] utilizes deep neural-networks to derive the representations of entities and entity-mention contexts, and then applies them to NED. (Chisholm et al., 2015): [40] utilizes an extra Wiki-links dataset to improve the performance of NED. (Pershina et al., 2015): [31] presents a graph-based disambiguation approach based on Personalized PageRank (PPR) [32] that combines local and global evidence for disambiguation. (Globerson et al., 2016): [29] proposes a novel coherence model with an attention-like mechanism, where the score for each candidate only depends on a small subset of entitymentions. (Yamada et al., 2016): [4] creates a state-of-the-art NED system that models entity-context similarity with word embeddings and entity embeddings trained using the Skip-Gram model [36]. (Chen et al., 2018): [6] proposes a comprehensive approach for short-text entity recognition and linking, by using the concepts of entities as fine-grained topics to explicitly represent the context and model topic coherence.   [5] combines benefits of deep learning with more traditional approaches such as graphical models [33] and probabilistic mention-entity maps.
(Zhu et al., 2018): [7] exploits the effectiveness of both knowledge-based and corpus-based semantic similarity methods for NED. Besides, a joint-learning framework for word and category embedding is proposed in this work. (Le et al., 2018): [8] treats relations as latent variables, and induces the relations without any supervision while optimizing the NED method in an end-to-end strategy.

C. EXPERIMENT SETTINGS 1) KB AND ALIAS-ENTITY MAPPING
As discussed above, for all datasets, this paper utilizes the standard entity candidate set, denoted as KB+YAGO, which is widely used by NED research [5], [29]. This standard entity candidate set KB+YAGO consists of two components: (i) an entity set KB, and (ii) a mapping scheme YAGO. Following previous work [29], the referent KB is derived from the Wikipedia subset of Freebase [51]. We also collect alias counts from Wikipedia page titles (including redirects and disambiguation pages), Freebase aliases, and Wikipedia anchor text. Besides, we optionally use the mapping from aliases to candidate entities released by [27], obtained by extending the ''means'' relations of YAGO [52]. Every annotated entity-mention could be mapped to at least one entity, and the set of entities included the ''gold'' entity. However, changes in canonical Wikipedia URLs, accented characters and unicode usually result in mention losses over time, as not all URLs can be mapped to the KB [53]. Table 1 summaries the statistics of the alias-entity mappings on the CoNLL test-b dataset and TAC dataset (restricted to non-NIL mentions), respectively, for the YAGO+KB aliasentity mapping, reported in [29]. Wherein, ''Mention-Recall'' indicates the percentage of entity-mentions with at least one known entity, ''Gold-Recall'' represents the percentage of entity-mentions where the gold-entity is included in the entity-candidates, and ''Unique'' aliases map to exactly one entity. Besides, ''Avg.'' is the number of entity-candidates averaged over entity-mentions.

2) ENTITY CANDIDATE SELECTION
We make use of a mention-entity priorP(entity|entitymention) [5] for entity candidate selection (Section IV-C3), which could be viewed as the context-independent probability of selecting entity conditioned only on entity-mention. This probability is computed by averaging probabilities from two indexes build from mention entity hyperlink statistics from Wikipedia and a large Web corpus [54], plus the YAGO index of [27] (with uniform prior) [5], [8]. As discussed before, for all datasets, this paper utilizes KB+YAGO (Section V-C1), with their associated prior probabilities (P(entity|entity-mention)) following [5], to generate entity candidates used in Section IV-C3, and only Top-30 entity candidates based onP(entity|entity-mention) are considered (i.e., N = 30). Note that, similar to previous work, we refrain from annotating mentions without any entity-candidate, implying that precision and recall can be different.

3) PARAMETER SETTINGS
For dataset CoNLL, we also use the development set (i.e., its test-a set) for the parameter tuning described above. Following [27], we only use 27,816 entity-mentions with valid entries in the KB+YAGO and report the standard micro-(aggregates over all entity-mentions) accuracy of the top-ranked candidate entities to assess disambiguation performance. For dataset TAC, following [37], [40], this experimental section utilizes entity-mentions only with a valid entry in the KB+YAGO, and reports the micro-accuracy score of the top-ranked candidate entities. Consequently, we evaluate our model on 1,020 entity-mentions contained in the test set.
For the proposed model, the traditional model configuration in the BERT model [9] is adopted here, although many variants exist: (i) the vector dimension k is 1024, and the maximum word length in an input-text is set as 256; (ii) for the bidirectional transformer encoder, the number of layers (i.e., transformer blocks) is 24, and the number of self-attention heads is 16; (iii) the feed-forward/filter size is 4096; (iv) the dropout probability is set as 0.1 for all layers; (v) the learning rate is 2e-5 (among 5e-5, 4e-5, 3e-5, and 2e-5), and the batch size is set to 252; (vi) the masked percentage is set to 35%, which releases the optimal experimental result (detailed comparative evaluation of the masked percentage will be discussed in Section V-D3); and (vii) only Top-30 entity candidates based onP(entity|entity-mention) are introduced into our model (N = 30 is widely adopted in current work for the same datasets, and moreover different choices of Top-N will be compared in Section V-D3). Other hyper-parameters are set to the values reported by [9].
During the masked-entity language model's training procedure (Section IV-C2), if the i-th entity-mention is chosen to be masked, we replace the i-th entity-mention with (i) the [MASK] token 80% of the time (e.g., in Figure 2 (b)), (ii) a random entity-mention 10% of the time, and (iii) the unchanged i-th entity-mention 10% of the time. Besides, the same 30K-WordPiece vocabulary [41] and the sub-word tokenizer of BERT are adopted with |V | = 30, 000 for stemming. The reason why 30K-WordPiece is utilized here, is that we want to transform each instance into a ''BERTcompatible'' format, i.e., into a sequence of WordPiece tokens. Since, the main architecture used here is pre-training model. We think that, sharing 30K-WordPieces helps our task because lexical-semantic relationships are similar for words composed of the same morphemes. GELUs (Gaussian Error Linear Units) activation function is adopted in Eq. (2). We obtain the set of concept-constraints T for the Concept Correlation Prediction (details in Section IV-B) from the previous work about instance conceptualization [14], [16], [46], and actually we obtained 154.2 million concept triples from Probase 6 [12].
Besides, the statistical t-test [55], [56] is employed here: To decide whether the improvement by model A over model B is significant, the t-test calculates a value p * based on the performance of model A and model B. The smaller p * is, the more significant the improvement is. If the p * is small enough (p * < 0.05), we conclude that the improvement is statistically significant.  of micro accuracy (i.e., mention-averaged accuracy). Like most previous work, we report performance on the 231 test-b documents. Besides, similar to past work, we retrieve possible entity-mention surfaces of an entity from: (i) the title of the entity; (ii) the title of another entity redirecting to the entity; and (iii) the names of anchors pointing to the entity [4]. From the experimental results in Table 2, [8], the proposed NED+CeBERT improves the micro accuracy score by 5.16%, 4.34% and 3.38%, respectively. This result verifies that the proposed unsupervised pre-training mechanism, enhanced with prior concept semantic (from extra lexical knowledge graph, e.g., Probase) in Section IV-B and masked-entity language model in Section IV-C, is beneficial for NED task. It should be noted that all the other models use, at least partially, engineered features. The merit of our proposed model is to show that, with the exception of theP(entity|entity-mention) feature, a pre-training mechanism is able to learn the best (implicit) features for NED without requiring extra expert input. Table 3 shows our results for the TAC dataset, where we also utilize the YAGO+KB entity-alias mapping mechanism (details in Section V-C1) for all our experiments. Note that, the competition evaluation for TAC dataset includes NIL entities, referring to the entities that are not in the knowledge graph, and TAC's participants are required to cluster NIL mentions across documents so that all mentions of each unknown entity are assigned a unique identifier. For entity-mentions that cannot be linked to our KB+YAGO, we simply use the mention string as the [NIL] identifier, similarly to previous work [29]. Following past work [4],  [4] (p * < 0.05).

D. EXPERIMENTAL RESULTS AND ANALYSIS 1) PERFORMANCE SUMMARY
[37], [40], we use entity-mentions only with a valid entry in the KB+YAGO, and report the micro-accuracy score of the top-ranked candidate entities in Table 3. Furthermore, experimental results show that, the proposed pre-training NED model significantly improves performance on the CoNLL dataset, however it only slightly contributes to performance on the TAC dataset. E.g., the proposed model without concept semantic boosting (denoted as NED+BERT in Table 3) only wins by a neck from the best result (Le et al., 2018) [8]. We argue that, this is because of the difference in the density of entity-mentions between the datasets. Especially, the CoNLL dataset contains approximately 20 entity-mentions per document, however the TAC dataset only contains approximately one mention per document. Surprisingly, we attain results comparable with those of some state-of-the-art models on the both datasets by only using original pre-training mechanism (i.e., NED+BERT). We observe significant improvement when adding the proposed concept-enhanced mechanism, i.e., NED+CeBERT. It also further validates intuitions that, deep contextualized embeddings trained using unsupervised language modeling (e.g., BERT [9]) are successful in a wide range of NLP tasks. Moreover, (Ganea et al., 2017) [5], (Chen et al., 2018) [6] and the proposed model all utilize concept information, while (Ganea et al., 2017) [5] introduces implicit concept and others use explicit concept information defined in high-quality lexical knowledge base Probase. Experimental results in Table 2 and Table 3 shows (Ganea  et al., 2017) falls behind the other two, which highlights the importance of extra prior knowledge and corroborates our motivation. Furthermore, compared with (Chen et al., 2018) [6], the proposed NED+CeBERT improves the micro accuracy score by 4.11% and 1.88%, on dataset CoNLL and dataset TAC respectively, while both of them utilize the same extra lexical knowledge resource. We attribute this success to the context-aware representation ability from pre-training scheme.
Moreover, [19] also introduced an entity-level masking strategy, masking the entity-mentions (essentially, the words) which are usually composed of multiple words, and ensured that all of the words in the same entity-mention were masked during word representation training, instead of only one word or character being masked. The proposed NED+BERT and NED+CeBERT also have the characteristics to do so. To sum up, there exist essential differences between our model and [19] (as shown in Table 4): (i) [19] proposed three masking strategies, i.e., basic-level masking, phrase-level masking and entity-level masking. However, all of them masked words (tokenized by 30K-WordPiece [41]) and trained only word embeddings. E.g., in its entity-level masking strategy, [19] masked entity-mentions (i.e., words) but not entities. Therefore, what [19] predicted were still words, and what [19] trained were only word embeddings. In contrast, our maskedentity language model (MELM) masks the entities and trains the entity embeddings. Hence, the ''entity-level'' masking in [19] is still word-level while ours is entity-level. (ii) entitylevel masking in [19] could be reviewed as a constraint and [19] inherited the conventional objective of BERT, while the masked-entity language model in ours is a novel objective (as discussed in Section IV-C). (iii) In both [19] and ours, all of the words in the same entity are considered together. We all take an entity as one unit, which is usually composed of several words. All of the words in the same unit are masked during representation training, instead of only one word being masked. For example, in the proposed maskedentity language model (MELM), if an entity contains multiple words, we compute its position embedding by averaging the embeddings of the corresponding positions.

2) EFFORTS OF CONCEPT CORRELATION PREDICTION
Current unsupervised pre-training models almost rely on (masked) language modeling objectives that exploit the knowledge encoded in large corpus. This paper complements the encoded distributional knowledge with external lexical knowledge (i.e., Probase), and proposes a novel classification objective-Concept Correlation Prediction (denoted as CCP and details in Section IV-B2). To evaluate the effects of our infection of prior lexical knowledge, this section trains the same model from scratch-with our CCP (which is a conceptenhanced BERT, with ''-CeBERT'' suffix, shown as the blue part in Figure 2 (a)) and without it (i.e., original BERT with ''-BERT'' suffix, shown as the green part in Figure 2 Table 2 and Table 3, respectively. The experimental results show the additional performance improvement when leveraging concept constraints. Especially, the relative micro accuracy TABLE 4. Comparison analysis between the proposed model (NED+CeBERT/NED+BERT) and [19], which also introduced an entity-masking strategy.
improvements over NED+BERT on CoNLL dataset and TAC dataset, are 2.34% and 1.33%, respectively. This demonstrate that, supplementing beneficial extra lexical knowledge with clean linguistic information from structured external resources may also lead to unsupervised pre-training models (such as BERT [9] etc.,) to improve NED's performance. Note that, as described in Section IV-B2, although ''comparative concept correlation'' relation based constraint is utilized here, other kinds of lexical relation constraints could also be adapted here according to the generality and flexibility of our model. Moreover, it also further validates intuitions from prior work on concept enhanced contextualized embedding (for word [15], entity [21] and sentence [23] etc.,) that leveraging concept prior for capturing robust semantic representation, has a positive impact on natural language understanding.
Moreover, some output results released by different comparative algorithms are shown in Table 5. Wherein, word underlined in ''text context'' is the entity-mentions. Comparing NED+BERT, which removes Concept Correlation Prediction, and NED+CeBERT, the efficiency of Concept Correlation Prediction is proved, especially in text context IV. Wherein, contextual word ''watch'' indicates significantly that the entity-mention ''Harry Potter'' prefers to a movie, with help of extra instance conceptualization algorithm for constructing concept correlation constraints.

3) OPTIMIZATION OF PARAMETERS
In this section, we analyze the robustness of the parameter setting in the proposed model, which could affect the overall performance: (i) the masked percentage; and (ii) different choices of Top-N . All these experiments in this section are run on CoNLL dataset and TAC dataset.
The masked percentage is ratio of marked entities of all the entities occurred in the given input-text sequence. When optimizing our Masked-Entity Language Model (MELM) objective in Section IV-C, we randomly mask some entities in the input-text sequence, and then train the embeddings and parameters to predict these masked entities based on the words and other non-masked entities. We evaluate increasing value of this parameter and the results are reported in Figure 3. The suffix ''-CoNLL'' and suffix ''-TAC'' indicates the experimental results evaluated on CoNLL dataset and TAC dataset, respectively. Overall, we find that the masked percentage considerably affects performance. When setting masked percentage around 35%, both the proposed  NED+CeBERT and NED+BERT could reach their optimal micro accuracy scores, while larger value will result in a performance drop as it loses beneficial semantic signals excessively and leads to many hardships for name entity prediction. Moreover, from the experimental results from Figure 3, we could observe that, the proposed NED+CeBERT is almost insulated from masked percentage fluctuations, and it proves the effective use of concept correlation prediction objective (Section IV-B). Figure 4 and Figure 5 show the sensitivity of our model and other comparative models to the parameter N , i.e., the number of candidate entities, on CoNLL dataset and TAC dataset, respectively. We can have a very good prediction results with a small number of candidate entities, compared   et al., 2018), wherein they invariably select only 7 candidate entities from the Top-30 candidate entities based on some extra heuristic. Interestingly, the proposed NED+CeBERT and NED-BERT could achieve the similar stability although we don't introduce any extra heuristic for second filtering.
Especially, we compare the proposed model with 8 other state-of-the-art NED models over 10 different datasets , 8 by implementing a web-based API for our model that is compatible with GERBIL. Details about the aforementioned baselines and datasets can be found in [64]. The strong annotation task (D2KB) is considered in this section, wherein the comparative models are given an input-text containing a number of marked tokens, and must return the entity model associates with each marked token. We evaluate the proposed model using the aforementioned datasets available by default in GERBIL, which are primarily based on news articles, RSS feeds, and tweets. Table 6 and Table 7 show the micro-F1 scores (aggregated across models) and macro-F1 scores (aggregated across texts), respectively. The experimental results show that the proposed model (denoted as NED+CeBERT) could defeat the state-of-the-art baselines provided by GERBIL in most cases, and also provide convincing evidence that the proposed model works well on several types of datasets.

VI. DISCUSSION
Previous neural network based methods have recently achieved strong results on NED task. Generally, the key component of these methods is an embedding model of words or entities trained using a large knowledge base. These models are typically based on conventional word embedding models that assign a fixed embedding to each word and entity. Although these methods have advanced the state-of-the-art results on NED task, they however failed to model and represent complex relationships and context, and multiple signals (i.e., words, entities and context etc.,) can not be fully interplayed in their architectures. To overcome these problem, in this study, we aim to test the effectiveness of the pre-trained contextualized embeddings for NED task. Because, On the TABLE 6. Micro-F1 scores of the proposed model (denoted as NED+CeBERT) and the state-of-the-art models provided by GERBIL platform.

TABLE 7.
Macro-F1 scores of the proposed model (denoted as NED+CeBERT) and the state-of-the-art models provided by GERBIL platform.
other hand, language model pre-training has been shown to be effective for improving many NLP tasks, relying on its ability of modeling the complex context. Moreover, previous pre-training work relied only on the distributional knowledge while ignoring lexical semantic information for understanding language. Therefore, this paper also investigates how to enhanced pre-training model's ability by leveraging prior concept information.
Overall, this paper tackles NED problem by leveraging two novel objectives for pre-training framework, and proposes a novel pre-training NED model. Especially, the proposed pretraining NED model consists of: (i) concept-enhanced pretraining, aiming at identifying valid lexical semantic relations with the concept semantic constraints derived from external resource Probase (details in Section IV-B); and (ii) maskedentity language model, aiming to train the contextualized embedding by predicting randomly masked entities based on the words and non-masked entities in the given input-text (details in Section IV-C).
(i) Concept-Enhanced Pre-Training: The proposed concept-enhanced pre-training model successfully blends extra lexical knowledge with distributional learning signals. Moreover, other kinds of linguistic information such as synonymy etc., could be adopted here because of the generality of the proposed model. The experimental results in Section V-D2 suggest that, available extra lexical knowledge can be used to supplement unsupervised pre-training models, with useful information which cannot be fully captured solely using plain text although the scale of corpus is huge enough.
(ii) Masked-Entity Language Model: The intuition behind the proposed masked-entity language model is that, following the strategy hold in original BERT that certain percentage of words is masked and needs to be predicted, therefore we try to mask a certain percentage of entitys in the similar way for training pre-training model for NED task, because both of the aforementioned conditions are at token-level.

VII. CONCLUSION
In the task of Named Entity Disambiguation (NED), this paper proposes a novel unsupervised pre-training model for words and entities, which is enhanced by external lexical knowledge (i.e., concept knowledge from Probase). Especially, a novel masked-entity language model is introduced here, aiming to train the contextualized embedding model by predicting randomly masked entities based on the words and non-masked entities in the given input-text sequence. Experimental results show that, our pre-training NED model successfully achieves enhanced performance on both datasets, as well as the benchmark GERBIL platform.