Quantifying Domain Knowledge in Large Language Models | IEEE Conference Publication | IEEE Xplore

Quantifying Domain Knowledge in Large Language Models


Abstract:

Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these mo...Show More

Abstract:

Transformer based Large language models such as BERT, have demonstrated the ability to derive contextual information from the words surrounding it. However, when these models are applied in specific domains such as medicine, insurance, or scientific disciplines, publicly available models trained on general knowledge sources such as Wikipedia, it may not be as effective in inferring the appropriate context compared to domain-specific models trained on specialized corpora. Given the limited availability of training data for specific domains, pre-trained models can be fine-tuned via transfer learning using relatively small domain-specific corpora. However, there is currently no standardized method for quantifying the effectiveness of these domain-specific models in acquiring the necessary domain knowledge. To address this issue, we explore hidden layer embeddings and introduce domain_gain, a measure to quantify the ability of a model to infer the correct context. In this paper, we show how our measure could be utilized to determine whether words with multiple meanings are more likely to be associated with domain-related meanings rather than their colloquial meanings.
Date of Conference: 05-06 June 2023
Date Added to IEEE Xplore: 02 August 2023
ISBN Information:
Conference Location: Santa Clara, CA, USA

I. Introduction

Contextual Large Language Models (LLMs) represent a significant breakthrough in Natural Language Processing (NLP), by generating contextualized word representations to disambiguate words with multiple meanings (such as BERT [1]). However, training these models is time-consuming and computationally intensive. The largest publicly available models are trained on general knowledge sources such as Wikipedia and BookCorpus. Some of these largest models have been fine-tuned on domain-specific corpora that include specialized terminology. For instance, the bert-based-cased model was fine-tuned on medical insurance text documents to develop the bert-fine-tuned-medical-insurance-ner model [2].

Contact IEEE to Subscribe

References

References is not available for this document.