Extracting and Analyzing Inorganic Material Synthesis Procedures in the Literature

Materials informatics requires large-scale collection and analysis of material synthesis procedures described in the literature for designing materials using computational methods. However, existing studies have not performed the paragraph-level analysis of the procedures. Moreover, since most of the synthesis procedures are described in natural language in articles and technical documents, it is necessary to structure them in a format that can be handled by computers through information extraction. Therefore, in this study, we construct a pipeline system that extracts synthesis procedures from text in the form of a flow graph and analyzes each procedure as a flow graph rather than a set of processes. The extraction system extracts entities by the deep learning model and relations between entities by the rule-based extractor from all paragraphs in the literature and selects procedures that include valid structures of entities and relations. Our evaluation of a benchmark dataset gave micro-averaged F-scores of 0.807, 0.830, and 0.609 for the entity extractor, relation extractor, and pipeline extractor, respectively. We applied this system to a large amount of literature and extracted approximately 90,000 flow graphs (procedures) containing approximately 4 million entities and 3 million relations. We performed several analyses, including taking statistics of the extracted graphs and checking frequent subgraphs for the extracted graphs. Commonly used methods in materials science were confirmed from our analyses; for example, ethanol is often dried by heating at 60 °C, and less-reactive noble gases are rarely included in the products. As a result, we experimentally confirmed that the extracted procedures were reasonable.


I. INTRODUCTION
Materials informatics has been attracting attention in the development of new materials by analyzing existing material properties and synthesis procedures for computational materials design. Materials informatics contributes to the reduction of development costs by reducing the number of experiments using real materials through computational tests and analyses. In order to perform such an approach, it is necessary to statistically analyze a vast amount of information.
One of the major challenges in computational experimental design is the collection of structural data of material names and synthesis procedures [2]- [6]. However, it is difficult to create a recipe database for material informatics from real tests conducted in a laboratory because the knowledge of The associate editor coordinating the review of this manuscript and approving it for publication was Sergio Consoli . experts has been often implicit and not well documented. The development of actual procedures for materials development requires a huge amount of time, of the order of years, and development by experts through repeated tests based on experience and intuition.
Several studies have attempted to extract materials synthesis procedures from the literature on natural language processing [7]- [13]. Mysore et al. created a corpus to extract synthesis procedures from the literature in the field of inorganic materials [12]. We previously created a corpus for a specialized field of inorganic materials and studied the extraction methods [7]. Mysore et al. also attempted to extract synthesis procedures as flow graphs using a deep learning method [13]. Despite these recent proposals of methods for extracting procedures from the scientific literature, there have been few studies analyzing the procedures for materials synthesis extracted from actual papers. Overview of the extraction pipeline system and illustration of material synthesis procedure extracted from a paragraph in Zhang et al. [1].
In this study, we analyzed the procedures for materials synthesis extracted from large-scale literature databases by a system that extracts the procedures as flow graphs and confirmed the characteristics of the procedures extracted by the system. We built a pipeline extraction system based on the synthesis procedure extraction method of Kuniyoshi et al. [7], and applied this system to a large set of literature for analysis to examine the applicability of the extracted procedures. We analyzed the materials synthesis procedures as a complete process, rather than as single units of operation by collecting statistics as a flow graph.

II. METHODS
We constructed a pipeline system that extracts synthesis procedures in a graph form from literature raw data based on a method of Kuniyoshi et al. [7], as shown in Figure 1. First, the pipeline extracts all paragraphs from the raw data collected by preprocessing them into a text format that can be input into subsequent models. Then, the system applies neural entity extraction and rule-based relation extraction to all paragraphs. Finally, the procedures are extracted from paragraphs based on simple rules to exclude fragmented procedures, such as those lacking a product.

A. MATERIAL SYNTHESIS PROCEDURES
Synthesis procedure using a flow graph is defined as same as the previous studies of Mysore et al. [12]. In the flow graph, edges connect each node, which is the synthesis operation and material, to represent relations between entities, including the order of operations and the input of materials into the operation.
The flow graph is defined in the materials science procedural text corpus to structure the flow of a procedure [12]. In this corpus, the synthesis procedure is annotated for the inorganic materials literature, and mentions of entities and relations between them are directly annotated in the document. Moreover, the data consisted of 200 training and 15 development and evaluation datasets.
In addition, in the aforementioned corpus, 19 entity labels and 16 relation labels were defined. An entity is defined as an element in a procedure, which can be categorized into several types such as an operation or material among others. Conversely, a relation is defined as a relationship between entities that describes the order of operations or the input of a material to an operation. The statistics and descriptions for each of these labels are shown in Table 1 and 2.

B. SYNTHESIS PROCEDURE EXTRACTION PIPELINE
We built a pipeline that enables the extraction of synthesis procedures from raw data. It can be divided into four parts. First, all paragraphs from the raw literature data are extracted for preprocessing. Second, entities from the paragraphs are extracted using an entity extractor. A bidirectional long short-term memory [14] and conditional random field [15]  Definitions and statistics of entity labels from the materials science procedural text corpus [12]. (Bi-LSTM-CRF) model [16] is used to predict entities using the representation of the pretraining model Mat-ELMo [17], which was trained on the literature in the field of inorganic material science. Third, the synthesis procedure is extracted based on the entities using a rule-based relation extractor. The entity and relation extractors are based on the method of Kuniyoshi et al. [7], which effectively extracted synthesis procedures from the literature. Finally, the paragraphs are classified depending on whether they contain the synthesis procedure or not based on the simple rule using the extracted entities and relations. Only the paragraphs that contain the synthesis procedure are selected.

1) PREPROCESSING
In the preprocessing stage of the data, all paragraphs in the literature were extracted. In this study, although several deep learning methods have been proposed to select paragraphs that mention material synthesis procedure [12], [13], [18], we decided to use all paragraphs to extract procedures because paragraph classification requires additional paragraph-level annotation and model development.

2) ENTITY EXTRACTOR
Entities were extracted by formulating the sequence labeling task with IOB (inside-outside-beginning) tags using a Bi-LSTM-CRF model [16] with Mat-ELMo [17] for the token embeddings. Mat-ELMo is an ELMo (Embeddings from Language Models) [19] pretrained with materials literature. We adopted this model although there are several other methods for entity extraction [20], because the survey confirmed that this method is effective for extracting synthesis procedures [7].
The input text was tokenized and embedded into a dense vector representation for each token using Mat-ELMo, as follows.
where t = [t 1 , t 2 , · · · t L ] is a list of tokens with length L and e = [e 1 , e 2 , · · · e L ] is a list of embeddings for each token. The probabilities of classes for tokens are obtained from LSTM.
where p = [p 1 , p 2 , · · · p L ] is a list of probabilities of classes for the tokens. The CRF is applied to p to determine the class from the probability.
where c = [c 1 , c 2 , · · · , c L ] is a list of classes for the tokens in the input text. The log-likelihood of the prediction of sequences was maximized to train this model. The Target-Material entities were automatically induced by considering the Material entities connected by the Recipe_Target relation as the Target-Material entities. This is necessary because it is difficult to develop a simple rule to distinguish the Recipe_Target relation from other relations in Section II-B3, and neural methods are suitable for distinguishing the relation. By introducing the Target-Material entities, we can extract the Recipe_Target relation using a rule as in Section II-B3.

3) RULE-BASED RELATION EXTRACTOR
The relation extractor is a rule-based model used to extract the relation between an entity pair. We adapted the rules of Kuniyoshi et al. [7] to the materials science procedural text corpus, which depends on the entity labels of an entity pair, the distance between an entity pair, and the order of occurrence of the entities. According to the combination of labels of the entities, the rules were divided into three types: Operation-Operation, Operation-Material, and other relations. In the following description of the rules, the starting point of a relation is called the head, the ending point is called the tail, and an edge is denoted as Head-Tail.

a: OPERATION-OPERATION
The relation Operation-Operation takes only a Next_ Operation label, which indicates the progress of the operation, e.g., if Operations appear in the order of A, B, C, and D, Next_Operation relations will be A → B, B → C, and C → D.
Next_Operation: We assumed that Operation is described in the order of the operation progression. Therefore, Operation entities are connected in the order in which they appear. We classified Solvent_Material, Atmospheric_Material, and Participant_Material labels based on dictionary matches because the words in Material are distinctive. A dictionary was prepared for each label. The relations connect a Material to the nearest Operation in the sentence if Material matches in the dictionary because these relations take specific Material entities. The dictionaries are listed in Tables 3, 4, and 5.  For a Recipe_Target label, as Material is the target to be extracted as Target-Material by the entity extractor in the previous section, the relation extractor connects the Target-Material to the nearest Operation.
Recipe_Precursor connects all Material except Target-Material that do not match the dictionary of Solvent_Material, Atmospheric_Material, and Participant_ Material to the nearest Operation.

c: REMAINING RELATIONS
The remaining nine relation labels are defined between the other pairs of entity labels: Property_Of, which indicates the condition of a material; Condition_Of, which indicates the condition of an operation; Number_Of, which VOLUME 10, 2022 indicates the relationship between a number and a unit; Amount_Of, which indicates a condition of a quantity; Type_Of, which indicates the type of the numerical condition; Brand_Of, which indicates the brand of a material or equipment; Apparatus_Of, which indicates the equipment used in an operation; Apparatus_Attr_Of, which indicates the numerical condition of the equipment; and Descriptor_Of, which indicates other conditions. For these labels, the rules are defined based only on the labels of the head and tail entities and the distance between them. We explain the detailed rules in the remainder of this section.
Property_Of: This relation takes Property-Unit or Property-Misc as the head and Material, including Target-Material or Nonrecipe-Material as the tail. When Property-Unit is the head, it is connected to the nearest Material in the sentence. When Property-Misc is the head, it is connected to the nearest Material or Nonrecipe-Material in the sentence.
Condition_Of: This relation takes Condition-Unit or Condition-Misc as the head and Operation as the tail. Condition-Unit and Condition-Misc are connected to the nearest Operation in the sentence using this relation.
Number_Of: Number is connected to the nearest Property-Unit, Condition-Unit, or Apparatus-Unit that appear after the Number in the sentence. We assume that, in most cases, the relation between num and its unit matches a pattern of ''(Number) (Unit)'' such as ''100 mL''.
Amount_Of: This relation connects Amount-Unit and Amount-Unit to the nearest Material or Nonrecipe-Material in the sentence.
Descriptor_Of: When Material-Descriptor is the head, it is connected to the nearest Material or Nonrecipe-Material in the sentence. When Apparatus-Descriptor is the head, Synthesis-Apparatus or Characterization-Apparatus can be a tail, but only the nearest Synthesis-Apparatus in the sentence is connected because Characterization-Apparatus is an apparatus for measuring characteristics and detailed descriptions are rarely provided.
Apparatus_Of: This relation connects Synthesis-Apparatus and Characterization-Apparatus with the nearest Operation with the priority given to the Operation that appears before the Apparatus in the sentence.
Type_Of: Property-Type and Apparatus-Property-Type are connected to the nearest Property-Unit and Apparatus-Unit in the sentence with this relation, respectively. When Condition-Type is the head, it is connected to the nearest Condition-Unit that appears before the Condition-Type in the sentence.
Brand_Of: The relation connects Brand to the nearest entities that may have brands (i.e., Material, Nonrecipe-Material, Synthesis-Apparatus, and Characterization-Apparatus) in the sentence.
Apparatus_Attr_Of: Apparatus-Unit is connected to the nearest Synthesis-Apparatus or Characterization-Apparatus.
Coref_Of: The relation is not detected by the rules because it is difficult to describe by rules. Although coreference needs to be taken into account to extract more accurate recipes, we leave it for future work to tackle the extraction of this relation, for example, combining extractors with high performance such as deep learning.

4) SELECTING PROCEDURES
Although only a few paragraphs actually contain procedures, our method attempts to extract procedures from all paragraphs. Therefore, we select the extracted flow graphs that actually contain procedures based on the extracted entities and relations.
We applied a simple approach to select the procedures to avoid additional annotation by developing a paragraph classifier that recognizes whether a procedure is present. As the target of this research is to extract procedural sequences, the extracted procedures require the inclusion of sequences wherein target materials are generated from other materials. Therefore, we selected the flow graphs that contain Recipe_Targets that are synthesized from Recipe_Precursors with some Operations.

III. EVALUATION
In our experiments, we first evaluated our pipeline to determine whether it could be used in practice using a benchmark dataset. In the evaluation of the pipeline, we separately evaluated entity extraction, relation extraction, and the pipeline of entity and relation extraction.

A. EVALUATION SETTINGS FOR ENTITY AND RELATION EXTRACTORS
We used the publicly available materials science procedural text corpus [12] to train the entity extractor and develop rules of the relation extractor. The corpus consists of 200 documents for training, 15 for development, and 15 for testing. The evaluation of the entity and relation extractors was performed using the testing portion of the corpus in which precision, recall, micro-F (overall F-score), and macro-F (averaged F-score of individual classes), were used as the evaluation metrics.
Flair [21], a library of machine learning for natural language processing, was used to implement the entity extractor. Meanwhile, the pretrained model Mat-ELMo [17] was not updated during training because catastrophic forgetting [22] happened in the preliminary experiment due to the small size of the corpus. Moreover, we used a stochastic gradient descent method for training. The learning rate was halved from the initial value of 0.1 when there was no improvement in performance over three epochs, and the method was stopped when the learning rate became smaller than 0.0001. A model that showed the best performance for the development data was used.
To verify the effectiveness of the setting for the corpus, we compared our entity extractor with BERT (Bidirectional Encoder Representations from Transformers)-base [23] and ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)-base [24], which have shown excellent performance in various natural language processing tasks. BERT is a transformer-based model that was pretrained by masked language modeling in a large number of studies [25]. Conversely, although ELECTRA is similar to BERT, it has shown higher performance due to an improved training method. To evaluate these transformerbased models, the embeddings of our entity extractors were replaced with these models, and the training was performed in the same way.
We evaluated the performance of the relation extractor separately from the entity extractor by assigning gold entities to verify the validity of the developed rules. This is the same setting as the rules that were developed. We also evaluated the performance of both extractors in the pipeline to determine the performance of the extraction system, i.e., the output of the entity extractor is fed into the relation extractor.

B. EVALUATION OF THE ENTITY EXTRACTOR
The evaluation results of the entity extractors, including transformer-based models, are shown in Table 6. The entity extractor successfully extracted entities with the highest performance, showing that it is important to use the model learned from the literature in the same domain (i.e., the materials science field) as the target corpus. From these results, we confirmed that Mat-ELMo is a suitable pretrained model to extract entities from the literature in materials science, which aligns with the finding by Kuniyoshi et al. [7]. As the transformer-based models use subword tokenization obtained from general text such as those found in Wikipedia, they might have difficulties in obtaining representations specific to the materials field.
As detailed results, evaluation on each class and a confusion matrix are shown in Table 7 and Figure 2, respectively. Overall, classes with few training data (cf. Table 1), classes with many variations to mention, and classes that are ambiguous with other classes tended to have lower performance. Entities that are the backbone of procedures (i.e., Operation and Material entities) were extracted with high performance, while Target-Material entities were extracted with low performance. This is because Target-Material is often confused with Material and it is written in a cumbersome notation; in other words, such entity may contain symbols such as    hyphens and parentheses. As for entities that are supplementary information for procedures, the numbers were extracted in high performance since they have few variations in the numbers. Conversely, .*_Misc and .*_Type have variations, ns were extracted with errors. Moreover, Apparatus-Property-Type was not extracted at all. This is because the mentions of Apparatus-Property-Type are ambiguous from our observation and there are only a few annotations and no examples in the development set; thus, we were not able to check the performance beforehand. Note that Apparatus-Property-Type is not likely to have a significant impact on the synthesis procedure and it is not critical in our analysis of the brief overview of the procedures in the literature.

C. EVALUATION OF THE RELATION EXTRACTOR
The results for the rule-based relation extractor in the pipeline setting with gold entities are shown in Table 8. Our extractor with gold entities obtained a micro-F of more than 0.8 without using any sophisticated deep-learning methods. The descriptions of the procedures are written in certain patterns (e.g., step-by-step explanation), and the simple rules were able to capture these patterns. On the other hand, the extraction performance of the pipeline had a micro-F of 0.609, which means that even if the input entities include entity extraction errors, the prediction of the relations is not really affected, even for the rule-based extractor.
For the detailed evaluation, the performance of each relation class for both the given entity and pipeline settings is also shown in Table 8 and the confusion matrix for the case where the gold entity is given is shown in Figure 3.
For the results with gold entities, our extractor, which is based only on the class of the entity and the pattern of occurrence, can perform extractions with high performance for classes that have few variations, for which we could easily create rules. For instance, Next_Operation and Recipe_Target are classes that describe the general flow of the synthesis procedure. The rules are created based on the observation of instances in the training data, where most procedures are written down in a similar way. For Next_Operation, we constructed a rule that connects the Operations in order, assuming that all Operations are described in processing orders. Under this rule, the recall of Next_Operation was 0.990, which indicates that our observation is common among the procedures. In addition, the performance of Condition_Of and Property_Of were high, because they have only a limited number of linguistic patterns. In contrast, the performance for classes such as Recipe_Precursor, Participant_Material, Solvent_Material, and Atmospheric_Material, which were extracted by dictionary-based rules, was low because our extractor considers only text without context and cannot handle a variety of occurrence patterns. Incidentally, Coref_Of was not extracted at all because there was no established rule for this relation.
For the pipeline result, the Next_Operation class, which is a key relationship for the sequence of operations in the procedure, had an F-score above 0.6. This indicates that the main flow of the procedure can be extracted to some extent, even if the extraction error has propagated from the previous entity extractor. In contrast, classes such as Property_Of, Brand_Of, and Apparatus_Attr_Of had extremely low performance due to errors in entity extraction. However, since such relation classes have little effect on the analysis in Section IV, these low performances are not considered significant.

IV. LARGE-SCALE EXTRACTION
In this section, we discuss the application of the extraction system to a large set of literature that was collected and analyzed to extract a synthesis procedure graph. In Section IV-A, we discuss large-scale extraction. We analyzed the procedures extracted from a large body of literature and investigated their characteristics in subsequent sections. In this investigation, we aimed to verify the rationality of the extracted procedures and determine whether they can be used to obtain useful information for materials development or other tasks. We performed four analyses: exploring frequently occurring subgraphs to verify typical procedures (Section IV-B); checking the frequency of the elements in Target-Material to obtain the trends in the elements and materials used (Section IV-C); calculating the term frequency-inverse document frequency (TF-IDF), an index based on co-occurrence, to check the relationship between Target-Material and Operation (Section IV-D); and analyzing an extracted procedure as a case study (Section IV-E).

A. APPLICATION OF EXTRACTOR
Our system was applied to extract synthesis procedures from a large set of documents that were collected from the Journal of Materials Chemistry A (JMCA), which is published by The Royal Society of Chemistry and focuses on areas such as batteries, fuel cells, sustainable materials, photovoltaics, supercapacitors, and water splitting. These articles were used as the literature source for large-scale extraction. We purchased all articles from 2015 to 2019 in XML format. The total number of articles was 14,310.
To extract synthesis procedures from large-scale literature, we performed the following process as described in Section II. First, as a preprocessing step, we extracted all paragraphs from the articles in XML format. Then, the entity and relation extractors developed on the material synthesis procedural corpus were directly applied to extract entities and relations. Finally, we selected flow graphs representing the procedures from all paragraphs. Although this method was developed for the material science procedural corpus, it can be used seamlessly because the domain of the corpus matches the domain of JMCA in terms of materials, and they are similarly described in our observations. From 14,310 studies, we obtained 347,480 paragraphs in total and extracted 89,578 procedures. Therefore, approximately one-third of the paragraphs contained procedures.
Surprisingly, each literature mentioned six procedures on average, even considering miscounts due to prediction errors. The statistics of the extracted entities and relations are shown in Table 9 and 10, respectively.

B. FREQUENT SUBGRAPHS
We extracted subgraphs that frequently appear in the data set using gSpan [26], which is an algorithm for mining frequent subgraphs, and then checked whether they were reasonable. If the frequently occurring subgraphs are reasonable from the human point of view, the flow graphs of the extracted procedures can be said to contain typical procedures.
The top 10 frequent subgraphs that contained Operation entities are shown in Figure 4 for each number of nodes in the subgraphs. The frequent subgraphs in the figure shows an example of a general procedure; for example, the first and sixth most-frequent 4-node subgraphs indicate that the compounds are heated and dried at 60 • C to remove ethanol, while VOLUME 10, 2022 the fourth most-frequent in the 5-node graph shows that 60 • C is appropriate for drying ethanol. From this observation, the graphs in the figure clearly show the general processes and we can conclude that the extracted procedures include typical processes.

C. ANALYSIS OF ELEMENTS IN TARGET-MATERIAL
We analyzed Target-Material to test the feasibility of the extracted procedures. In our results, every procedure has Target-Material because those without it were filtered out. It was difficult to directly analyze them because Target-Material is unique in most literature. Therefore, we analyzed the elements included in Target-Material.
As a basic step, the included elements were counted and the element names were extracted from left to right, ignoring valence and other numerical values. We filtered out the Target-Material that was determined to be composed of one element or none to avoid the effect of erroneous extraction.
The frequency of the elements is summarized in Figure 5. The elements were normalized using the number of Target-Material entities and scaled logarithmically. Elements with a single letter may be overestimated owing to errors in parsing elements contained in Target-Material. For example, tin (IV) oxide (SnO2) nanoparticles mentioned as ''SnO2NP'' are recognized as a combination of SnO2, nitrogen (N), and phosphorus (P), even though in this case, NP represents nanoparticles.
The results show that elements O, C, N, P, and S, which are commonly contained in various compounds, and Ni and Ti, which are conventionally used in electrodes, appear frequently. Moreover, it was confirmed that the less reactive noble gases (elements in the rightmost column of Figure 5) and high-period elements (third row from the bottom) that are difficult to obtain are not widely used in Target-Material. Considering that the literature set used for extraction included the topic of solar cell electrodes, group 6 elements W and Mo were often identified as they are often used for this application. Therefore, the extraction results are in accordance with the domain of the data set. The extraction performance for Target-Material was not high in Table 7, but the general trend is considered reasonable.

D. CORRELATION BETWEEN TARGET-MATERIAL AND OPERATION
We calculated the TF-IDF values of Operations for each element in Target-Material to analyze the correlations between the Operations and elements. TF-IDF is calculated based on the frequency that a term appears and uniqueness of the term to the document. Since the purpose of this analysis was to discover a unique Operation for each Target-Material and to find the Operations that may be required to construct that material, TF-IDF is a suitable metric for this analysis. To identify unique Operations, we ranked the TF-IDF scores of all elements in Target-Material, where the top-ranked Operations were considered as important for constructing Target-Material.
Here, the Operations in a procedure are considered as terms in a document, and each element in the Target-Material of a procedure is considered as a document. A document D e corresponding to an element e is defined in which a set of all procedures is P, a set of lemmatized Operations in a procedure p is O p , and a set of elements in Target-Material of a procedure p is E p .
is a function that returns a universal set U if the condition in the bracket is satisfied, and an empty set φ otherwise; therefore, D e includes an aggregation of the Operations in the procedure to synthesize Target-Material containing element e.
The TF-IDF score TFIDF(d, t) is defined as a function of a term t in a document d, where count(d, t) is a function that returns a term t count in a document d, and 1[·] is a function that returns 1 if the condition in the bracket is satisfied and 0 otherwise.
VOLUME 10, 2022 The top 20 Operations ranked by the TF-IDF score are shown in Table 11 for frequent typical elements (O and C) and metallic elements (Co, Ni, and Ti) shown in Figure 5. We lemmatized Operation using scispaCy [27] before calculating the TF-IDF score. We observed the general Operation in the top 10 Operation. However, the 11th to 20th Operation are characteristic of each element. We used the Operation that occurs with the O element, which is one of the most common elements in the analysis. In addition, ''grow †'', which belongs to the columns of Co and Ni, is a characteristic Operation as they are used when sheets or crystals are created. In addition, ''anneal*'', which is seen for metallic elements, is a unique Operation for metals. These factors observed from Operation are consistent with the general knowledge in the field.

E. CASE STUDY OF EXTRACTED PROCEDURES
An example of an extraction procedure is shown at the bottom of Figure 1. In this study, we sampled 10 documents and checked 64 procedures that were included in the sampled documents to test the feasibility of the extracted procedures. The obtained procedures that were manually chosen from the sampled procedures provided information and conditions about each operation, as well as how the Target-Material was produced from the input material. However, several errors were identified, such as ''0.075 mmol'' in the middle of the paragraph that was not extracted as an entity and ''water'' and ''ethanol'' that were the Solvent_Material of ''washed'' were incorrectly connected to ''placed'' by the relation extractor. However, the general framework of the procedure was extracted, and the analysis of a large number of extracted procedures is expected to contribute to the development of materials.
In contrast to these methods, deep-learning-based methods, which have been making progress in recent years, can extract named entities with even higher accuracy and flexibility. There are various entity extraction methods such as span-based models [36], transition-based models [39], [40], and sequence-to-sequence methods [38], but the most basic method is the sequence-labeling-based ones. The Flair library [21] used here also applies sequence labeling in the BiLSTM-CRF model [16], which uses a representation of the pretrained models [19], [23].

B. RELATION EXTRACTION
After the named entity extraction, the relation extraction is performed to extract the relations between entities. While most of the existing research on relation extraction has focused only on relations within a single sentence, recent studies have been extended to document-level relation extraction, which extracts relations across sentences. In general, document-level relations are more complex than sentencelevel ones, which have been studied using rule-based [41] and shallow machine-learning-based methods [42], [43]. Deeplearning-based methods [44]- [46] are often used to extract document-level relations. For example, relation classification is performed based on context-embedded representations [44] and representing the document as a graph [45], [46]. However, for the extraction target of this study, we used a rule-based extractor because our observations confirmed a clear pattern of appearance of the relations.

C. INFORMATION EXTRACTION FOR MATERIALS SCIENCE
Several studies have attempted to extract material names with their properties [47], [48] and material synthesis procedures [7]- [11], [13] by natural language processing. Swain et al. extracted chemical entities and their properties from scientific documents [47]. Mysore et al. tried to obtain an inorganic material synthesis procedure to solve the situation where research on inorganic materials has not progressed as much as research on organic materials [13]. However, large-scale extraction and analysis of the extracted synthesis procedures have not been performed. Kuniyoshi et al. also performed research on extract synthesis procedures for inorganic materials, especially for all-solid-state batteries [7].
Material synthesis procedures are defined in various ways [7], [9], [12], but the main elements to be extracted are common: target materials; ingredients of target materials; operations such as mixing and annealing; and conditions such as temperature and time. These elements are the basis of materials synthesis. Analyses of the synthesis procedures in the literature have provided information about these materials for practical applications. For example, Mahbub et al. analyzed the experimental conditions to gain insight into solid-state battery materials from a database constructed from information in the literature [49]. In addition, Kim et al. predicted precursor materials of perovskite using text embedding [17], and Segler et al. designed synthesis procedures for organic molecules by predicting precursors from target materials [50]. These studies demonstrate the utility of materials synthesis procedures in the literature.
Moreover, studies on the analysis of the extracted material information were also conducted. Saal et al. developed a machine-learning-based model to predict new compounds from database information [51]. Raccuglia et al. predicted the reaction outcomes for the crystallization of templated vanadium selenites from experimental notebooks [52]. However, analysis of the synthesis procedures has not progressed significantly, and in particular, there is no research that has analyzed the procedure as a flow graph. In this context, the current study makes a significant contribution to the field, because we analyzed the synthesis procedures as graphs extracted from large amounts of literature.

VI. CONCLUSION
In this study, we constructed a pipeline system for extracting synthesis procedures from large literature datasets and analyzed the obtained synthesis procedures. Our pipeline extractor consists of: preprocessing; an entity extractor (comprising Mat-ELMo and Bi-LSTM-CRF models) with an F-measure of 0.807; a rule-based relation extractor (F-measure of 0.830); and valid procedure selection, giving an overall performance of 0.609.
We analyzed materials synthesis procedures in a graph form to consider synthesis as a linked process, not a set of individual processes. In the analysis, we confirmed that the extracted procedures were reasonable because they contained typical processes as subgraphs, the characteristics of each element were indicated, and the correlations between target materials and operations were reasonable.

VII. FUTURE WORK
For further study, we will attempt to improve the performance of procedure extraction and provide more insights into the synthesis procedures, represented as procedures for discovering new materials. Since only a limited number of classes can be extracted with high performance, improving the extraction performance of the system is our next challenge to obtain more accurate procedures. This could be achieved by improving entity extraction by applying normalization as a preprocess, e.g., replacing materials written in chemical formulas with special tokens or a resolved coreference. In addition, the relation extractor could be modified by applying a flexible method, such as deep learning, to identify classes that are difficult to distinguish by rules. Furthermore, we will continue to analyze the procedures extracted by the system and develop materials based on the obtained knowledge.
KOHEI MAKINO received the B.E. and M.E. degrees from the Toyota Technological Institute, Aichi, Nagoya, Japan, where he is currently pursuing the Ph.D. degree.
From 2019 to 2021, he was a Research Assistant with the National Institute of Advanced Industrial Science and Technology. His research interests include deep learning, natural language processing, and information extraction.
FUSATAKA KUNIYOSHI received the M.S. degree in computer science from the Nara Institute of Science and Technology, in 2017. He is currently a Researcher with the National Institute of Advanced Industrial Science and Technology and Panasonic Corporation. His research interests include natural language processing and computer vision.
JUN OZAWA received the Ph.D. degree in system science from the Tokyo Institute of Technology, Yokohama, Japan, in 1998. Since 1990, he has been a Researcher at Panasonic. He is currently the Director of Panasonic-AIST Advanced AI Research Laboratory, National Institute of Advanced Industrial Science and Technology. His research interests include machine learning and its industrial applications.
MAKOTO MIWA received the Ph.D. degree from The University of Tokyo, in 2008. He is currently an Associate Professor with the Toyota Technological Institute and a Visiting Researcher with the National Institute of Advanced Industrial Science and Technology. His research interests include natural language processing, deep learning, and information extraction. VOLUME 10, 2022