Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers

With the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding the latest technological trends and conduct derivative studies in science and technology. Therefore, the continual collection of extensive academic papers, structuring of metadata, and construction of databases are significant tasks. However, research on automatic metadata extraction from Korean papers is not being actively conducted currently owing to insufficient Korean training data. We automatically constructed the largest labeled corpus in South Korea to date from 315,320 PDF papers belonging to 503 Korean academic journals and this labeled corpus can be used for training the models of automatic extraction for 12 metadata types from PDF papers. This labeled corpus is available at https://doi.org/10.23057/48. Moreover, we developed inspection process and guidelines for the automatically constructed data and performed a full inspection of the validation and testing data. The reliability of the inspected data was verified through the inter-annotator agreement measurement. Using our corpus, we trained and evaluated the BERT based transfer learning model to verify its reliability. Furthermore, we proposed new training methods that can improve the metadata extraction performance of Korean papers, and through these methods, we developed KorSciBERT-ME-J and KorSciBERT-ME-J+C models. The KorSciBERT-ME-J showed the highest performance with an F1 score of 99.36%, as well as robust performance in automatic metadata extraction from Korean academic papers in various formats.


I. INTRODUCTION
The metadata of academic papers refers to information such as the title, author names, affiliated organizations, abstract, keywords, DOI, volume, issue, year of publication, pages, and journal name of papers. These metadata constitute one of the most important components of modern information systems [1]. Many researchers need information system that allows them to search and utilize academic papers in various science and technology fields to identify The associate editor coordinating the review of this manuscript and approving it for publication was Le Hoang Son . rapidly developing scientific and technological trends and conduct derivative research. In order for this system to provide more intelligent information retrieval and recommendation service to researchers, it is an essential process to build high quality and large amount of metadata for each paper.
In South Korea, currently, information related to science and technology is constructed and managed through the Korea Institute of Science and Technology Information (KISTI). The KISTI also provides ScienceON, 1 a public service that supports users to conduct R&D activities in one site by linking and providing information, data, and knowledge infrastructure, as well as making it easy and convenient to search and utilize various academic information. In particular, ScienceON provides an integrated search service for papers, patents, reports, trend information, and researcher information. For the search service for papers, we have been collecting the original document files of Korean academic papers through agreements with various academic societies that publish journals, and building our own metadata database (DB) from those papers. However, as most academic papers are distributed with unstructured forms such as PDF, the metadata DB is being constructed manually. Since the manual method is time-consuming and a costly process, a method for automatically extracting metadata from PDF files of academic papers is required.
Research on automatic metadata extraction from academic papers is being actively conducted. CERMINE [2] is an automatic extraction method for metadata and reference information that uses coordinates and text information from PDFs of academic papers. The rule-based methodology is used for recognizing and extracting metadata from the PDF, whereas the machine learning based support vector machine (SVM) [3] is used for classifying the metadata that belong to each area. GROBID [4] extracts the metadata from PDF documents using the conditional random field (CRF) [5] method. Both CERMINE and GROBID, which are open-source software, are representative works of automatic metadata extraction from PDFs of academic paper. However, these methods were developed for English papers. As English and Korean language have different structures, these methods are difficult to apply to automatically extract metadata from Korean papers and their performance is limited for the construction of a real database.
Studies on automatic metadata extraction from Korean academic papers have also been conducted. The authors of [6] proposed a bidirectional gated recurrent unit (GRU) CRF model using the texts and coordinates of each metadata layout. The training data were collected from 10 types of Korean academic journals and an automatic extraction model for 9 metadata fields was developed. In [7], the metadata layouts of Korean papers were extracted and a layout pretraining model, Layout-MetaBERT, was developed, which could perform automatic metadata extraction. Training data for 70 types of Korean academic journals were constructed and an automatic extraction model for 10 metadata fields was proposed.
The format of papers is an important factor in the automatic extraction of metadata from PDF files. The format varies depending on the journal type, and even in the same journal, the format may differ according to the publication year, volume, and issue. Currently, over 500 types of Korean academic journals are being collected by the KISTI. However, the training data obtained from the two prior studies [6], [7] are not sufficient to cover this number of journal formats. No other studies relating to metadata extraction from Korean papers have been conducted, mainly because of lack of large amounts of high-quality training data that is required to follow the latest research trends using deep learning based models.
Although many studies have presented models that exhibited high performance in various tasks in natural language processing (NLP). Nevertheless, deep learning based NLP technologies are difficult to apply in various countries because of the insufficient amount of data for numerous languages. Therefore, recently, researches on constructing corpus based on various languages for the development of NLP technology is being actively conducted [8], [9]. Many researches are also being conducted in South Korea to develop datasets and models for applying various NLP technologies to the Korean language [10], [11], [12]. However, lack of sufficient data remains a problem. Moreover, for metadata extraction research on Korean academic papers, it is necessary to construct training data for more than 500 types of various Korean academic journals. Therefore, in this study, we automatically constructed the labeled dataset using the PDF documents of Korean academic papers and the metadata DB built by KISTI until recently. The dataset we built consists of papers from 503 types of Korean academic journals, and it is labeled with 12 metadata fields. Compared to previous studies, the number of journal types and metadata fields has increased, and in particular, the number of journal types increased significantly, so that the automatic metadata extraction model trained with our data can extract metadata well overall for papers in more various formats than before. In addition, we tried to completely inspect a portion of the entire constructed dataset and make it perfectly labeled data. We composed this correct dataset as validation and testing dataset. For data inspection, we developed inspection process and guidelines in this study. As a result, the dataset we constructed and inspected is available at http://doi.org/10.23057/48 for many researchers to use in various studies as well as automatic metadata extraction.
In this study, we trained and evaluated the automatic metadata extraction model to verify the reliability of our constructed data. We evaluated the performance with a transfer learning model based on the bidirectional encoder representations from transformers (BERT) [13], which is the pre-trained language model (PLM) and showed high performance in various NLP tasks. As a result, this experiment showed that our constructed data has high quality and reliability. Furthermore, we conducted additional experiments by developing two automatic metadata extraction models to which new training methods were applied. The first method is to generate a unique code that includes information on various journal formats and input it together with the training data to learn the features of various journal formats. The second method is to modify the input structure of the model so that coordinate information for each metadata layout can be further learned together with the unique code. We developed these two models as transfer learning models based on the KorSciBERT, 2 the PLM specialized in the science and technology domain. In this paper, the first model is simply referred to as ''KorSciBERT-ME-J'' and the second model as ''KorSciBERT-ME-J+C''. As a result of evaluating these two models trained with our corpus, we found that our proposed methods are suitable for the automatic metadata extraction from Korean papers.
The contributions of this paper, which are also shown in Fig. 1, are summarized as follows: • We automatically constructed labeled corpus for automatic extraction of 12 metadata fields from 315,320 Korean academic papers belonging to 503 journals. Our corpus is the largest data for automatic metadata extraction from Korean papers in South Korea, and it has been open so that many researchers can use it for researching and developing deep learning models.
• We developed an inspection process and guidelines of annotation for the automatically constructed data and performed a full inspection on validation and testing data. By measuring inter-annotator agreement, the high reliability of the inspection process and inspected data of this study was verified.
• We trained an automatic metadata extraction models using our corpus, and evaluated the performance of this model. And the evaluation results confirmed the reliability of our corpus. Furthermore, we developed new models, KorSciBERT-ME-J and KorSciBERT-ME-J+C, for automatic metadata extraction and our proposed models showed high performance. The remainder of this paper is organized as follows. Section 2 introduces the related works. Section 3 describes the method of automatically constructing and inspecting labeled data, and our proposed automatic extraction models. Section 4 presents the results of experiments for evaluating the dataset we constructed and the performance of the automatic metadata extraction models. Finally, the conclusions of our study and future work are discussed in Section 5.

II. RELATED WORKS
Studies on the automatic extraction of information such as metadata from documents have been conducted using various approaches. Initially, metadata extraction using the rule-based method was performed extensively [14], [15], [16], [17]. Furthermore, metadata extraction using keywords and a pattern matching method was performed in [18]. The authors of [2], [19], and [20] proposed automatic metadata extraction techniques using the SVM [3] as a classifier. Moreover, in [4], [21], and [22], the CRF [5] technique was used and the work in [23] applied the hidden Markov model.
Subsequently, the excellent performance of deep learning technology was proven and various studies using a classifier model that can classify specific metadata categories efficiently were conducted. The method presented in [24] extracted metadata used convolutional neural network (CNN) [25] and bidirectional long short-term memory (LSTM) CRF [26]. The authors of [6] suggested a technology for classifying metadata using a bidirectional GRU CRF model.
In recent years, BERT [13], the PLM, has achieved overall high performance in tasks in the NLP field and many studies were conducted to demonstrate its high performance through a methodology that fine-tunes the BERT model for text classification tasks [27], [28]. On this basis, many BERT models that were pre-trained using corpus of English papers have been proposed [29], [30]. Moreover, KorSciBERT, which was pre-trained using corpus of Korean papers and patents in the science and technology field, were released in South Korea. Recently, a research using the BERT model for automatic metadata extraction from Korean papers have been conducted. The authors of [7] proposed a metadata extraction method using Layout-MetaBERT, which was developed by pre-training the BERT model with metadata layout information of papers. However, in this research, training data from 70 types of Korean academic journals were used, which is a limited amount to cover overall journals. Unlike the work of [7], we constructed the training data from approximately 500 types of Korean academic journals, so that a new model using our data has higher coverage that can extract metadata from Korean papers in more various formats of journal. In this paper, we developed new models, KorSciBERT-ME-J and KorSciBERT-ME-J+C, by applying the concept of learning auxiliary sentences referring to [27].

III. METHODOLOGY
In this section, we first introduce the method of automatically constructing labeled data for the automatic metadata extraction from PDF papers, and thereafter, we explain the method of inspecting the automatically constructed data. Finally, we present two models for automatic metadata extraction proposed in this paper. An overview of our approach is depicted in Fig. 1.

A. AUTOMATIC CONSTRUCTION OF DATA 1) SELECTION OF JOURNALS
At present, the KISTI is manually constructing metadata DB from the original PDF documents of Korean papers. In this study, we aimed to construct an annotated corpus from PDF papers that are collected by the KISTI for developing automatic metadata extraction models. First, among total journals which are collected until April 16, 2021, we selected the range of the journal types according to the following criteria: journals for which transmission rights have been secured; correspond to thesis journals, academic journals, and English journals; journals which final publication year is between 2020 and 2021. A total of 527 journals were selected based on these criteria and 595,830 papers published in these journals were selected. As our aim was to construct labeled corpus for automatic metadata extraction from PDF papers, we narrowed the scope to papers with having original PDF VOLUME 11, 2023 files among the selected 595,830 papers. Finally, we acquired original PDF files of 564,153 papers corresponding to a total of 504 journals, and a metadata DB for these papers as the target data for automatically constructing the labeled corpus.

2) DATA PREPROCESSING
Data preprocessing is to extract and structure text from PDF files for the constructing an annotated corpus. In this study, we extracted text from the PDF files using the Python library PDFMiner 3 and automatically extract the layout boxes.
We performed preprocessing to extract layout boxes of each metadata using information of the TextBox, TextLine, and Char objects that are provided by PDFMiner. The process is outlined as follows: First, TextLine objects are extracted from the first page of the PDF file. The TextLine object contains the text information of each line unit in the PDF documents along with the x(x 0 , x 1 ) and y(y 0 , y 1 ) coordinate information of each line. In some cases, it is considered to be one line when it is viewed from the PDF document, however the TextLine object extracted from PDFMiner is not combined into one line, but is separated into two lines. Therefore, the first concatenation process, which concatenates objects with the same y 0 value, is performed for each TextLine object to combine these separated lines into one line.
Subsequently, the first concatenated lines are subdivided into Char, which is a subordinate object of TextLine. The Char object is separated into single character units and contains text information, the x(x 0 , x 1 ) and y(y 0 , y 1 ) coordinate information, the font name, and the font size for each character. Here, (x 0 , y 0 ) coordinates mean the respective coordinates of 3 https://pypi.org/project/pdfminer the x-axis and y-axis of the lower left point in the layout box. And, (x 1 , y 1 ) coordinates mean the respective coordinates of the x-axis and y-axis of the upper right point in the layout box. As each metadata is not necessarily consist of a single line, a second concatenation process to bind the lines into box units is required. An example of this case is the abstract, which appears on multiple lines. Each metadata in the paper is usually written with the same font name and font size. Therefore, we could extract the font name and the font size of the Char object and concatenate them into one box if they match. However, even if the font name and the font size are same, the metadata may differ. For this case, we concatenated the lines in box units by providing a condition for the distance between lines, so that when this distance is changed, it will be recognized as a different box and the lines will not be concatenated.
The layout boxes of each metadata from PDF file of each paper were extracted through the above method, as shown in Fig. 2. Each layout box contained eight values of information: (7) height of the box (y 1 − y 0 ), (8) font size of the text. In this study, among the eight values, (2) to (8) were defined as the coordinate information of the layout box.
We performed data preprocessing on the original PDF files of 564,153 papers from 504 journals that were selected as per Section 3-A-1). During this process, layout boxes were not extracted from 248,833 papers owing to problems such as the cases of image PDF files, no text box in the PDF file, the text box being completely blank, or all texts being broken. Eventually, we could extract 5,648,620 layout boxes from 315,320 PDF papers belonging to 503 journals through the data preprocessing. Thus, the average number of extracted layout boxes that was recognized as metadata layout on the first page of one paper was estimated as approximately 18.

3) AUTOMATIC LABELING OF METADATA
In this paper, we selected 12 metadata fields, that are mainly required for searching papers and linking information in ScienceON services, as target labels for data construction. In total, there were 13 labels for the metadata fields, including the ''Out of boundary'' field, which indicated that the data did not belong to any of the 12 selected metadata fields. The labels for the metadata fields were defined as indicated in Table 1. In this process, each layout box, that was extracted through the preprocessing, was automatically tagged as one of the labels among 13 metadata fields. The automatic labeling method is comparing the value (ground truth) of the metadata DB already built manually with the text (tagging target) of each layout box. A separate rule-based algorithm was used for each metadata field and the data were labeled by calculating the text match rate. In our work, the match rate was based on the Levenshtein distance [31] and was calculated using the Python fuzzywuzzy 4 library. If the calculated match rate of the ground truth and tagging target for each metadata field was 90% (or 80%) or higher, the data were labeled with the corresponding metadata. The reason for labeling through partial matching rather than exact matching is that the value of the metadata DB was deliberately processed by humans. In other words, the text that appears in the PDF papers is not constructed as it is, but is intentionally modified by humans to make it more sophisticated. For example, in the case of an English journal name in a paper, it is often written in the form of an abbreviation. In this case, the text extracted from the PDF papers is in the form of an abbreviation, but the value of metadata built by human is original full name of journal. At this time, this layout box should be tagged with ''journal'', but if we use the exact matching method, there will be a problem that the label is not tagged. Therefore, through a number of trial and error, we set an appropriate threshold of matching ratio for each metadata field, and developed an algorithm to tag labels if it matches more than the threshold. The threshold for title, author name, author affiliation, keywords, and journal name were set to be 90% and for abstract to be 80%. Exceptionally, in the case of DOI, there is no threshold and if the ground truth is included in the tagging target, it is labeled.
Prior to calculating the text match rate for the automatic metadata labeling, we performed the following preprocessing for the texts of the tagging targets and ground truths so that an accurate comparison could be provided. All of the English alphabet characters were changed to lowercase letters, and special characters and spaces were removed. Furthermore, character ID forms (e.g., (cid:25) and (cid:18702)), <TEX></TEX> formula tags and the regular expression patterns in LaTeX symbols were removed. In the case of metadata such as ''abstract'' and ''keywords'' among the layout boxes, phrases indicating the abstract and keyword areas (e.g., ''Abstract'', ''Summary'', ''Outline'', ''Keywords'', Key terms'', and ''Key phrases'') often existed. If these phrases were included in the tagging targets and extracted, label tagging was not possible because the matching ratio with the ground truths would be low. Therefore, these phrases were defined as a dictionary so that they could be removed prior to the text comparison.
In summary, metadata labels were automatically tagged for each of the 5,648,620 layout boxes from 315,320 papers belonging to 503 journals using such methodology. All data that were labeled in a unit of layout box, were constructed in a format including the label and eight features for each box (see Fig. 2).

B. INSPECTION OF AUTOMATICALLY CONSTRUCTED DATA 1) DEFINITION OF GOAL
Owing to the diversity of journal formats and writing notations of papers, the data preprocessing and automatic labeling processes yielded many errors. Moreover, the metadata DB used as ground truth contained several errors. Therefore, the automatically constructed data contained many tagging errors. Hence, we attempted to modify the tagging errors by manually inspecting them. Because it is difficult due to limited time and resources to fully inspect the total of labeled data, only part of the data was inspected in this work. Since validation and testing data should be accurate to be used for evaluation, we tried to compose the inspected part of the entire data as validation and testing data. Therefore, we set the goal as inspecting about 20,000 papers, using the inspected data as validation and testing data, and using the remaining about 290,000 papers as training data.
As validation and testing data had to be constructed evenly for all 503 journals, a number of papers as goal was set for each journal according to the ratio of the number of papers per journal. The highest and lowest numbers of papers per journal among the automatically constructed 315,320 papers from 503 journals were 10,302 and 5, respectively. The ratio of number of papers per journal relative to the total number of papers was calculated for each journal. The goal numbers of validation and testing data per journal were calculated by applying the ratio of the number of papers per journal.
The calculated number of papers with a decimal point was rounded up to an integer. If the calculated goal number was between 0 and 1, the goal number was adjusted to 2 so that one paper could be used for the validation and testing data each. Finally, the total goal number for the validation and testing data was determined as 20,014 papers from 503 journals.

2) DESIGN OF INSPECTION PROCESS
In this study, we aimed to acquire 20,014 papers from 503 journals that were free of tagging errors by manually inspecting the papers that were selected as inspection targets. Before conducting the inspection, we designed the inspection plan for constructing highly reliable data and the inspection process.
Based on the expected duration of the inspection work that was derived from the preliminary inspection work, we planned to perform inspection work using six experts as annotators for approximately 60 days. Furthermore, to achieve high quality of goal data, the six annotators were divided into groups of two, who performed an inspection on the same data, and then we tried to acquire the data for which the same inspection results were derived as goal data.
The inspection process is shown in Fig. 3. We divided the inspection work into Task 1 and Task 2 and acquired data that passed both Task 1 and 2 as goal data. In this section, the unit of data is a layout box. Task 1 is to verify that each layout box is well extracted so that it can be accurately categorized into one correct metadata without overlapping among 12 metadata fields. In Task 1, if two or more metadata existed in one layout box or if one metadata was divided into two or more layout boxes, annotators excluded those layout boxes. In this case, the excluded layout box did not proceed to Task 2 and was considered as excluded data. That is, among all the layout boxes extracted from one paper, only the layout box passed in Task 1 goes to Task 2.
Task 2 consisted of determining whether the automatically tagged labels were correct for every layout box in each paper, and if any incorrect label existed, annotators modified it to a correct label. We acquired the data that were corrected in Task 2 as the first inspected data. Then managers performed and additional inspection on the first inspected data using the same method, and through this process, we acquired the second inspected data as final goal data. As a result, the papers that passed the all the inspection processes consisted of only layout boxes that are clearly extracted during the data preprocessing process and accurately labeled.

3) ANNOTATION GUIDELINES
In this study, we aimed to acquire 20,014 papers from 503 journals that were free of tagging errors by manually inspecting the papers that were selected as inspection targets. Before conducting the inspection, we designed the inspection plan for constructing highly reliable data and the inspection process.
In this study, we developed the annotation guidelines, in which we can inform the structure of the data inspection work file, purpose of the inspection, and inspection method to the annotators in order to improve the quality of data by consistently inspecting data. The developed annotation guidelines were distributed to the annotators before they performed the inspection work. We regularly updated every cases of exceptions that could be discovered during the inspection work to our guidelines through communication between manager and annotators. The annotation guidelines are summarized as follows: • The annotators perform inspection work on the Excel file with the structure shown in Fig. 4.
• In one cell of column A, the URL address at which the original PDF file for one paper can be viewed. The annotators click and access the URL, which is a value of column A of the work file, and open the original PDF file of the paper for inspection. The annotators inspect the extracted layout boxes and labels by simultaneously verifying the PDF file and work file of each paper.
• The annotators fill the cell of column B with ''0'' if all layout boxes of one paper should be ''excluded'' and be deleted in the Task 1 inspection. This cell is filled with ''1'' to ''pass'' one paper if any one layout box is correctly extracted and should be ''pass''.
• The layout boxes that have been extracted from one paper are listed in the cells of column C. In one paper occupying one concatenated cell in column A, text values of multiple layout boxes exist (average of approximately 18 per paper). All 12 metadata fields may or may not exist for each paper. Therefore, while viewing the original PDF file of each paper, annotators only check the metadata fields that exist in that paper. And they check only the first page of the PDF paper.
• The cells of column E should be filled with ''x'' when the layout box must be ''excluded'' in Task 1, which determines whether two or more metadata fields appear in one box or one metadata field is divided into two or more layout boxes. That is, an ''x'' is entered in column E if the layout box is not completely separated in one box for each metadata field. For example, in Fig. 4, the second layout box should be deleted because there are two metadata, journal name and DOI, in one layout box.
• Automatically tagged labels for each layout box are listed in the cells of column D. These are the inspection targets for Task 2 regarding whether the labeling was performed correctly. The annotators inspected the label for each box that has not been excluded with an ''x'' in column E. That is, they do not perform Task 2 for boxes that are checked with an ''x'' in column E.
• The annotators should enter the modified label in the cells of column F, if the value of column D is incorrect.
That is, the cell of column F is left empty if the value of column D is already correct. For example, in Fig. 4, the sixth layout box corresponds to the author name in English, but the label is ''O''. In this case, annotators should modify the label as ''author_name_en''.
• The cells of column G are check items for exception cases. If two or more layout boxes should be labeled as VOLUME 11, 2023 same metadata field, the priority should be set among them. A ''1'' is entered for a box with highest priority and ''2'' is entered for boxes with lower priorities.

C. AUTOMATIC METADATA EXTRACTION MODELS
In this study, we also proposed two kinds of deep learning based models for the automatic metadata extraction from PDF paper. The proposed models performed text classification task, taking each layout box as input and predicting the metadata label. We developed the classification models through transfer learning based on KorSciBERT. Our study is the first work to develop a model to automatically extract metadata from Korean papers using KorSciBERT. The first model is KorSciBERT-based automatic metadata extraction model with unique code that contains journal format information (KorSciBERT-ME-J), which is a method to train using only text-type inputs. And the second model is the KorSciBERT-based metadata extraction model with unique code of journal format information and coordinate information (KorSciBERT-ME-J+C), which is a method to train by additionally using numeric inputs, which is coordinate information of each layout box.

1) KORSCIBERT-ME-J MODEL
The KorSciBERT-ME-J model is an intuitive method to train additional features that can improve metadata prediction performance using auxiliary sentences. In this paper, we generated unique codes for each layout box and used these as auxiliary sentences. Therefore, the input sequence composed of two different sentences: the text sentence of input Among the eight values of the layout boxes (see Fig. 2), only the text sentence was used. The architecture of KorSciBERT-ME-J is presented in Fig. 5.
In this paper, how we generated the unique code is the singularity of our proposed methods. We assumed that the journal format of each paper would have a significant impact when automatically extracting metadata for input papers. Therefore, we expected that it would improve the metadata extraction performance if we generate the features of journal format of each paper into a unique code and use it as an auxiliary sentence while training the model. Academic papers are written in a variety of formats for each journal, and even within the same journal, the format may differ depending on the publication year, volume (Vol.), and issue (No.). If the format of paper differed, the number of layout boxes, order of extraction of each metadata field, and patterns of formal phrases in the content would all differ. Based on these points, we tried to reflect the features of format in training the metadata extraction models.
The unique code of each layout box was a combination of the journal code, year of publication, volume, issue, starting page, and sequence number of each layout box extracted from the paper. Each piece of information was delimited by including the symbol '_' between them. First, the journal code consisted of six characters (e.g., CPTSCQ), and the journal code is the management code for each journal standardized by KISTI, and it is determined for all journals. Second, the volume and issue were indicated by a pattern with 'v' in front of the volume and 'n' in front of the issue, respectively. Third, the starting page is the page where each paper starts in the same publication year, same volume, and same issue in the same journal, and was indicated by the page number itself. Finally, the extraction sequence number of the layout box was generated by setting the first extracted box as 0 and sequentially increasing the number by 1. For example, the unique code of the first layout box of a PDF paper for the ''Journal of the Korea Society of Computer and Information, 2021, vol. 26, no. 7, pp. 10-18'' was ''CPTSCQ_2021_v26n7_10_0'' (the journal code for the ''Journal of the Korea Society of Computer and Information'' is ''CPTSCQ''). The key concept of the proposed KorSciBERT-ME-J model is using our unique code that reflects the journal format features for each input layout box with input text in the training process.
When the text of the input layout boxes was composed of n tokens, the input tokens were defined as s 1 , s 2 , . . . , s i , . . . , s n (1 ≤ i ≤ n); when the unique code that was generated for an input layout box was composed of m tokens, the input tokens are defined as u 1 , u 2 , . . . , u j , . . . , u m (1 ≤ j ≤ m). The [CLS] was used as the starting token of a sequence and [SEP] was used as the delimiting token to separate sentences in the sequence. As shown in Fig. 5, an input sequence of the KorSciBERT-ME-J model was composed of two sentences in which the input text sentence s and unique code u of each layout box were delimited by [SEP]. The input sequence was embedded in the input layer using tokenizer of KorSciBERT, which was developed by combining WordPiece [32] and Mecab-Ko. 5 The hidden state vector H [CLS] was calculated using the KorSciBERT encoder for the embedded input representation. This hidden state vector calculated the probability of the metadata label r through the fully connected layer with Softmax classifier, and finally, predicted the metadata for the 5 https://bitbucket.org/eunjeon/mecab-ko where W is the weight of the output layer, b is the bias, and the metadata label r is the label for the 13 metadata fields, including ''Out of boundary.'' The classification loss was calculated using log(sofmax(h)).

2) KORSCIBERT-ME-J+C MODEL
Furthermore, we developed the KorSciBERT-ME-J+C model with a structure for further training coordinate information of each layout box. All the eight values of the layout boxes (see Fig. 2) were used. The KorSciBERT-ME-J+C model had the same input sequence, input layer, and KorSciB-ERT encoder as those of the KorSciBERT-ME-J model, but a different structure in the final output layer (see Fig. 6). In this model, a vector that concatenated the coordinate information vector to the hidden state vector H [CLS] from the KorSciBERT encoder was input into the fully connected layer with Softmax classifier to predict the final metadata label r: where W is the weight of the output layer, b is the bias, and the metadata label r is the label for the 13 metadata fields. The coordinate information vector of each layout box was defined as c. The coordinate information vector included seven values (x 0 , y 0 , x 1 , y 1 , width, height, and font size) that were extracted from data preprocessing (see Fig. 2). The classification loss was calculated using log(sofmax(h)). VOLUME 11, 2023

IV. EXPERIMENTS
We conducted experiments from the data and model perspectives. The first experiment is for measuring the reliability of our data that were inspected in this study. The second experiment is for measuring the performance of the automatic metadata extraction models using our corpus and through this experiment we tried to verify the reliability of our data and the performance of our proposed models.
In this study, we automatically constructed an annotated corpus of 315,320 papers belonging to 503 journals for metadata extraction. After performing an inspection on 53,699 papers, we finally constructed an inspected dataset consisting of 44,403 papers from 503 journals. Six annotators and one manager participated in the inspection work, which was performed for a total of 59 working days from July 5 to September 29, 2021. The annotators were divided into groups of two, who inspected the same data, and the data from which the two annotators inferred the same inspection results were obtained as the final goal data. As the inspection results of two annotators could differ during the inspection process, the results were classified into two categories. Category A corresponded to the case where the both two annotators modified the labels equally as a result of Task 2 for all layout boxes that were equally passed in Task 1 for each paper. This was the inspection result that we aimed to secure as the final goal data in our work. Category B corresponded to the case where the two annotators modified the labels of some layout boxes differently in Task 2 for each paper. The manager inspected all data of Category A among the first inspected data and the second inspected data were acquired as the final goal data.
Consequently, the inspection result statistics of papers for which the second inspections were completed for a total of 59 days are showed in Table 2. A corpus of 44,403 papers that were free of tagging errors was acquired using the data of Category A, which were reliable data because the inspection results of the two annotators were completely consistent. Therefore, we overachieved the goal of 20,014 papers belonging to 503 journals that was set in the step of inspection process design. Among the 44,403 papers, we randomly selected as many as the goal number of papers for each of 503 journals. The finally obtained 20,014 papers were divided into validation and testing data at a ratio of 1:1 for the metadata extraction model. The validation and testing datasets were divided into 9,895 and 10,119 papers, respectively, to evenly cover all 503 journals. Accordingly, among all of the automatically constructed data, 295,306 papers belonging to 503 journals, excluding the validation and testing datasets, were used as the training dataset. Finally, we constructed annotated data of 5,557,300 layout boxes from 315,320 papers. The final dataset is shown in Table 3.

2) INTER-ANNOTATOR AGREEMENT EVALUATION
In this paper, the data inspection process was designed in such a way that two annotators formed one group and inspected the same data. If the annotators were to inspect different data, it would not be possible to evaluate the reliability of the inspected data and the annotation process. Agreement between annotators on the same data measures the consistency, or reproducibility of the annotation process, and inter-annotator agreement evaluation is part of an iterative method for developing reliable annotation schemes [33]. Thus, for verifying the inspected data and annotation process in our work, we measured the inter-annotator agreement using the Cohen's kappa coefficient [34], which is commonly used. The kappa coefficient may have a value between -1 and 1: ≤ 0 indicates a ''poor match'', 0.00 to 0.20 indicates a ''slight match'', 0.21 to 0.40 indicates a ''fair match'', 0.41 to 0.60 indicates a ''moderate match'', 0.61 to 0.80 indicates a ''substantial match'', and 0.81 to 1.00 indicates an ''almost perfect match'' [35]. The kappa coefficient is calculated as follows: where p o is the proportion of observed agreement; that is, the observed accuracy, and p e is the proportion of expected agreement by chance, which is computed based on the marginal frequencies. In our work, the kappa coefficient was measured for the finally inspected data through the whole inspection process including Task 1 and 2. The kappa coefficient was measured as the agreement with the results of Task 2 for layout boxes that two annotators belonging to each group of Group 1, Group 2, and Group 3 equally passed in Task 1. Since the inspection results of Task 2 were layout boxes tagged with one of 13 metadata fields, we calculated the kappa coefficients for multiple classifications. Table 4 presents the kappa coefficient measurement results for the three groups. The kappa coefficients of Group 1, 2, and 3 were 0.9857, 0.9761, and 0.9834, respectively, showing ''almost perfect match''. Also, the average kappa coefficient of the entire group was 0.9817, which showed ''almost perfect match''. As a result, this study showed that the inter-annotator agreement was the highest level of ''almost perfect match'', indicating that the inspection process, annotation guidelines, and the inspected data developed in our work were reliable.

B. MODEL EVALUATION
We conducted experiments to train and evaluate the metadata extraction model to verify the reliability and usability of our dataset. In addition, we also performed experiments to evaluate the automatic metadata extraction performance of our proposed models, KorSciBERT-ME-J and KorSciBERT-ME-J+C. In these experiments, we used the training, validation, and testing datasets constructed in our study (Table 3)

1) BASELINE
The automatic metadata extraction model using our data performs classification task that predicts metadata fields for layout boxes extracted from PDF papers. In this experiment, we applied the transfer-learning model based on BERT [13], which is widely used as a baseline for text classification tasks, as a metadata extraction model. Since it is a metadata extraction model for Korean papers, the BERT-Base Multilingual Cased [13] version of PLM was used. In this section, we will simply refer to this BERT-based metadata extraction model as ''BERT-ME''. Furthermore, we set BERT-ME model as the baseline to compare the performance with two models proposed in this paper.

2) HYPERPARAMETERS AND EXPERIMENTAL SETUP
We selected the optimal hyperparameters through the grid search method with a set of maximum sequence length as {384, 512}, batch size as {16, 32}, learning rate as {3e-5, 5e-5}, and epochs as {3, 4}. As a result of experiments based on the validation set, we set the maximum sequence length to 512, batch size to 16, learning rate to 3e-5, and epochs to 3 as optimal hyperparameters. In addition, we set dropout to 0.1 in the final output layer when fine-tuning the models, and applied the Adam optimizer. A system with a 48-core Intel(R) Xeon(R) Gold 6226 CPU, 128 GB RAM, and two Nvidia Tesla V100 32 GB GPUs was used in this experiment.

3) EXPERIMENTAL RESULTS AND ANALYSIS
In this study, the metadata extraction models were trained by applying the optimal hyperparameters, and the results of evaluating the metadata classification performance for the testing dataset are shown in Table 5. We used the macro-based precision, recall, F1, and accuracy as evaluation metrics. The BERT-ME model, baseline in our experiment, showed high performance with F1 score of 98.86%. Even fine-tuning the PLM showed high performance, and through this result, we confirmed that our automatically constructed data has high reliability and quality. On the other hand, the models proposed in this paper, KorSciBERT-ME-J and KorSciBERT-ME-J+C, showed the F1 scores of 99.36% and 99.09%, respectively. KorSciBERT-ME-J showed the highest performance, 0.5%p higher than baseline and 0.27%p higher than KorSciBERT-ME-J+C in F1 score. Moreover, all our models outperformed the baseline for every evaluation metrics. Through this result, we confirmed that the training methodology we developed contributes to performance improvement, and for the automatic metadata extraction model for Korean papers, the model based on KorSciBERT, the PLM with corpus of Korean papers and patents, has better performance. In addition, the automatically constructed training dataset had some tagging errors because it was not inspected, but the overall performance of the models trained with this training dataset was high. This result means that the automatic labeling methodology we performed was also reliable.
We also measured the F1 performance for each metadata field ( Table 6). The number of layout boxes for each metadata field is indicated in the ''Count'' column in Table 6. And the total of ''Count'' is the same as 159,925 layout boxes extracted from 10,119 papers which is testing dataset. The KorSciBERT-ME-J model achieved the highest performance for every metadata fields except that ''Author affiliation (in Korean)'' was slightly lower to KorSciBERT-ME-J+C. In addition, the F1 performance of each metadata field showed similarly high performance overall.
According to the results of this experiment, the KorSciBERT-ME-J model using only the unique code as an auxiliary input has higher performance than the KorSciBERT-ME-J+C model that used both the unique code and coordinate information for training. Accordingly, we conducted additional experiments comparing the performance according to whether unique code and coordinate information were used to train the models. For this comparative experiment, we trained two additional models. First, we developed the KorSciBERT-based simple metadata classification model, in which only the text of each layout box used as input, and compared to the KorSciBERT-ME-J, the difference is that it does not use a unique code for training. And this model is simply referred to as ''KorSciBERT-ME'' in this paper. Second, we developed another KorSciBERT-based model using coordinate information and the text of each layout box as input during training, and also compared to KorSciBERT-ME-J+C,  the difference is that unique code is not used as an auxiliary input. This model is simply referred to as ''KorSciBERT-ME+C''. The result of comparing the performance of the comparison models is shown in Table 7. As a result, KorSciBERT-ME-J outperformed other models and achieved the highest performance for all evaluation metrics. The KorSciBERT-ME in which only the text in the layout boxes was input showed the lowest performance with 99.05% based on the F1-score. The KorSciBERT-ME+C had F1 performance of 99.06%, which was almost similar to KorSciBERT-ME, but showed slightly higher performance. Through this experiment, we confirmed that the method we proposed, using the unique code that implies the features of the journal format to train the metadata extraction model, has a significant effect on the improvement of performance. In addition, it was expected that performance would be improved as more features was used as input to train model, but considering that the model trained with additional coordinate information did not help to improve performance. Therefore, in the case of the metadata extraction model using language models, we confirmed that using text features (e.g., unique code) as an additional input for training contributes to performance improvement, rather than using numerical features (e.g., coordinate information) as input.

V. CONCLUSION AND FUTURE WORK
In this paper, we constructed the labeled corpus for training and evaluating the models that can automatically extract metadata from PDF files of Korean academic papers. We constructed the data of 315,320 papers belonging to 503 journals, which can cover almost all of the Korean journal collected by KISTI in South Korea. This corpus is divided into training, validation, and testing datasets. The training dataset was consisted of 5,241,746 layout boxes that were extracted from 295,306 papers, the validation dataset consisted of 155,629 layout boxes that were extracted from 9,895 papers, and the testing dataset consisted of 159,925 layout boxes that were extracted from 10,119 papers. In this study, the entire layout boxes were automatically tagged with one of the 13 metadata fields through our automatic construction method. In addition, we developed an inspection process and annotation guidelines to correct tagging errors for automatically labeled data. Through this inspection process, we thoroughly inspected all validation and testing datasets to make these datasets without tagging errors. Furthermore, we conducted an inter-annotator agreement evaluation on the inspected datasets, and the agreement was measured to be almost perfect match, so that we were able to verify the reliability of our inspection process, annotation guidelines, and inspected datasets. The finally constructed dataset was released as an open source at http://doi.org/10.23057/48, allowing many researchers to use it for research and development. It is the largest gold standard dataset of metadata classification from Korean papers to date in South Korea. By disclosing our constructed Korean corpus, we will be able to contribute to solving the problem of lack of Korean dataset, which is an obstacle to the development of Korean NLP technology.
We trained and evaluated the automatic metadata extraction model using our dataset. In our experiment, the metadata prediction performance of the baseline model showed excellent performance, so that we confirmed that our datasets are reliable and useful, and could use as a gold standard for extracting metadata from Korean papers. Moreover, two metadata extraction models for Korean papers were proposed in this study: KorSciBERT-ME-J, and KorSciBERT-ME-J+C. As a result of performance evaluation, KorSciBERT-ME-J showed the highest performance of 99.36% in the F1 score. Recently, we developed Application Program Interface (API) system based on KorSciBERT-ME-J, and applied this system to the metadata DB construction process that is currently being performed manually at KISTI. However, our corpus and the proposed have some limitations to be improved. Since the corpus and the proposed models are specialized for Korean academic papers, it is difficult to extract metadata for various foreign journals which with in a format completely different from Korean journals. In addition, because the dataset was constructed based on the result of the data preprocessing step that extracts the metadata layouts from the PDF file, it is inevitably dependent on the result of the preprocessing. In other words, if the PDF is an image type PDF, or if the PDFMiner, the Python library that extracts text objects from PDF file, has errors itself, or the PDF file itself is a file with errors such as text encoding errors, the metadata layout boxes may be extracted incorrectly.
In the future, we will study a methodology that can extract desired metadata from documents without being affected by various formats such as R&D reports and patents, as well as more various foreign academic journals, not limited to 503 Korean academic journals. To this end, we will develop the corpus and model through an image task approach that can extract information from documents by recognizing documents as images so that there are no limitations in file formats such as PDF.
HYESOO KONG received the B.S. and M.S. degrees in information and industrial engineering from Yonsei University, Seoul, South Korea, in 2016 and 2018, respectively. She has been working with the Korea Institute of Science and Technology Information (KISTI), Daejeon, South Korea, since 2019. Her research interests include deep learning and machine learning-based natural language processing, text mining, and data analysis.