FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications

The unprecedented growth of the research publications in diversified domains has overwhelmed the research community. It requires a cumbersome process to extract this enormous information by manually analyzing these research documents. To automatically extract content of a document in a structured way, metadata and content must be annotated. Scientific community has been focusing on automatic extraction of content by forming different heuristics and applying different machine learning techniques. One of the renowned conference organizers, ESWC organizes state-of-the-art challenge to extract metadata like authors, affiliations, countries in affiliations, supplementary material, sections, table, figures, funding agencies, and EU funded projects from PDF files of research articles. We have proposed a feature centric technique that can be used to extract logical layout structure of articles from publishers with diversified composition styles. To extract unique metadata from a research article placed in logical layout structure, we have developed a four-staged novel approach “FLAG-PDFe”. The approach is built upon distinct and generic features based on the textual and the geometric information from the raw content of research documents. At the first stage, the distinct features are used to identify different physical layout components of an individual article. Since research journals follow their unique publishing styles and layout formats, therefore, we develop generic features to handle these diversified publishing patterns. We employ support vector classification (SVC) in the third stage to extract the logical layout structure (LLS)/ sections of an article, after performing comprehensive evaluation of generic features and machine learning models. Finally, we further apply heuristics on LLS to extract the desired metadata of an article. The outcomes of the study are obtained using the gold standard data set. The results yields 0.877 recall, precision 0.928 and 0.897 F-measure. Our approach has achieved a 16% gain on f-measure when compared to the best approach of the ESWC challenge.


I. INTRODUCTION
Research plethora over the web increases rapidly due to millions of annual publications of research articles [1]- [3]. These cross-disciplinary publications are linked through online citation indexes so that a research community can establish the relevance to the literature. More often scholars cogitate queries based on complex scenarios to retrieve their required research documents from this colossal scientific resource. Researchers post their queries to find scholarly articles on famous online search engines like Google The associate editor coordinating the review of this manuscript and approving it for publication was Qingli Li . Scholar 1 or Semantic Scholar, 2 and renowned digital libraries like DBLP 3 or ACM. 4 However, these platforms do not hold adequate potential to intelligently process the query which results into surplus results. This is due to the fact that these search engines harness citation indexes and article's full text search to retrieve the information wherein one of the potential aspects, structural information is overlooked. Therefore, human-understandable metadata like author name, affiliation, country, email, section headings with levels, funding agency, table, and figure caption requires indexing and storage in a machine comprehendible form to facilitate the processing of metadata-based queries. In this context, metadata extraction tools have gained popularity to extract and store machine recognizable research articles content to furnish precise semantic queries. Recently, the research community has deemed metadata extraction as a challenge. In the current era, metadata extraction from PDF files is considered as a great challenge. Every year, various efforts are put in the form of well-known conferences like SemPub, 5 CLSA, 6 OKE, 7 QALD, 8 and RecSys, 9 with an objective to improve the quality of linked data [4].
Research document structure analysis and information extraction has been a well-researched area due to increase of publications in diversified domains. Currently, information extraction methods are constructed upon machine learning and heuristic-based approaches. Machine learning techniques rely on a group of fine-tuned parameters to learn good feature representations for structure extraction. These techniques are sub-categorized into ML models built using support vector machines (SVM), conditional random fields (CRF), decision trees, and deep learning based algorithms for the feature extraction and semantic detection on text documents [5]. However, they require large tagged pre-trained dataset; it has limited aspects of natural language processing and limited performance guarantee. Initial work exhibits that heuristicbased approaches perform better because they are built on natural language processing and regular expression. These approaches are constructed on a pre-defined set of rules, and requires domain knowledge for diversified data. Therefore, the rules are required to be updated every time when documents from a new publisher are extracted.
The document layout and elements are composed on geometric location and font properties of the text, which varies for different publishers. The text in a research document has different font attributes, which can uniquely identify a group of elements. These distinct features are discussed detailed in sect 3.3. The generic features dedication contributes to develop probabilistic models in different applications, as Zare et al. [6] in their study investigated the influence of the features to detect community structures. We have proposed a four-staged novel approach ''FLAG-PDFe'', which uses distinct physical layout properties and generic logical layout features to transform PDF based research documents into a metadata layout aware format.
The first stage reads and extract textural information from digital-born PDF files. It reads the pdf file as raw stream of data and extract text along with text font properties encapsulated in boundary boxes that consists of geometric layout coordinates. The output is in the form of text chunks with incorrect reading order. We corrected the reading order in this stage by first identifying the column layout style of the document and then calculated the line numbers of each line by measuring the distance from neighboring text chunks. These are the physical layout properties, which are distinct in every research document. We call this the pre-processing stage that generates text block with font properties, geometric location, column styles and correct reading order. The second stage extracts the feature set which will be used by the classification algorithm to extract the logical layout structure (LLS) elements in next stage. The system processes textual and physical layout properties from extracted text content to generate generic features sets. We studied formatting styles of different publishers and proposed the set of features that can be used to extract LLS from articles of diversified layout and formatting styles. The third stage uses support vector classification [7], [8](SVC) algorithm to extract different sections of the document. For model selection, we performed systematic study on different machine learning algorithms and feature selection. The final stage performs metadata extraction from LLS/ sections identified in previous stage. This stage extracts metadata information consists of author name and affiliation, country of affiliation, supplementary material, table and figure caption, funding agency, and funded projects. This extracted metadata is stored in a csv file for comparison with start-of-the-art. We have utilized diversified and comprehensive dataset to evaluate our proposed methodology. For this purpose, ESWC-2016 10 (European Semantic Web Conference) conducts a semantic challenge titled as ''Extracting information from the PDF full text of the papers'' that has provided dataset along with the gold standards available at the link, 11 which conducts semantic challenges titled as ''Extracting information from the PDF full text of the papers'', along provide publicly release benchmark datasets. We evaluated our results with the results published by the conference organizers to compare with the challenge's winner [9]. The results yields 0.877 recall, precision 0.928 and 0.897 F-measure.
The subsequent section discusses the background and demonstrates previous work in detail (sect. 2). The architecture and approaches proposed to extract metadata and section information has been comprehensively explained in the methodology section (sect. 3).

II. LITERATURE REVIEW
The metadata and structure extraction from PDF-based documents is a well-explored research area since the emergence of the initial online search engines like CiteSeerx to find scholarly articles [10]- [18]. A PDF file is stored in raw binary data form and lacks structured information tags, or metadata that identifies different layout components. It requires further processing to correct the reading order and remove intercepting objects. Another prominent obstacle is the diversified nature of the document layout styles and textural features adopted by 10 https://2016.eswc-conferences.org/ 11 https://github.com/ceurws/lod/wiki/SemPub16_Task2 VOLUME 8, 2020 different scientific publishers. Initially, document structure and content were extracted using template-based techniques but researches proposed supervised machine learning techniques and specifically linear conditional random field (CRF) [19] to replace rule-based template matching. Bijari et al. [20] in their study introduced a hybrid algorithm based on hueristics and clustering, using BB-BC and k-means to improve k-means shortcomings in text mining. ParsCit [21] adopted CRF to extract layout and bibliographic metadata from a research document and sectLabel further explored CRF to identify different contents of a research document. Later on, ParsCit improved its technique by adopting LSTM [22]. CERMINE [23] compared its bibliographic metadata and layout extraction approaches with popular approaches of that era and outperformed PDFExtract [24] in bibliographic information extraction. Recently, CiteCeerX 12 team introduced PDFMEF [25] that blends artifacts of their existing approaches in a framework.

A. RULE BASED TECHNIQUES
Rule-based approaches require dataset to build set of rules constructed upon natural language processing, regular expression and domain knowledge. Constantin et al. [26] proposed a two-stage rule-based system (PDFX) using text feature and characteristics for conversion of PDF artifact documents into XML structure. Klink and Kieninger [27] proposed a rule-based approach with combination of textual features on OCR based documents. Similarly, Déjean and Meunier [28] proposed a method for transforming PDF legacy file into a structured XML file. Ramakrishna et al. [29] introduced (LA-PDFText), a layout aware system to facilitate text mining in the biomedical domain. Recently, Ahmad et al. [9] constructed heuristics-based approach with effective combination of tagged and plain text based information extraction techniques. These approaches immensely rely on regular expressions and text pattern matching. Heuristics based approaches require predefined set of rules and text patterns to identify different elements of the research document. Hence, huge set of rules has to be maintained for diversified datasets. Therefore, the underlying problem with these approaches makes them hard to manage the overlapping rules. Furthermore, domain specified knowledge is required to apply them on a diverse dataset.

B. MACHINE-LEARNING TECHNIQUES
Supervised machine learning approaches generally use classification models where pretraining of the model is required by tagging of data based on unique features. Limited number of unsupervised machine learning algorithms are used for metadata extraction as clustering algorithms are not well suited in such cases. Granitzer et al. [30] investigates the use of SVM and CRF on real-world systems ParsCit and the Mendeley Desktop, for automatically extracting bibliographic metadata. Tkaczyk et al. [31] presented an adaptive 12 http://csxstatic.ist.psu.edu/ modular workflow for extraction of metadata from borndigital scholarly articles. Huy Hoang Nhat Do et al. introduced Enlil [32] that uses CRF to identify authors and author affiliations and SVM to discover relationship of authors with their respective institutions. Kiss and Strunk [33] purposed an unsupervised approach to detect language-independent sentence boundaries by using abbreviations. Klampfl et al. [34] proposed an unsupervised approach to extract presentation optimized scientific documents without structural information. The approach extracts adjacent text blocks from the PDF file by identifying the geometrical relationship, and further classifies them to originate logical structures. Tsai et al. [35] used an unsupervised bootstrapping algorithm for categorization and identification of the scientific research by transforming citation contexts into coherent concepts.
Previous approaches are mostly built on the data sets of research articles that are from single publisher, hence they produce optimum test results. Their performance reduces in cases when articles from different publishers are tested. The feature sets of most techniques are not well-defined. In most scenarios, the benchmark annotated dataset is not available along with the evaluation tool, therefore, a comprehensive analysis cannot be performed. The selected datasets to evaluate our proposed technique has diversified publishing style and unique metadata requirements. FLAG-PDFe outperformed renowned techniques when evaluated on selected dataset. We have made following contributions in this regard: 1) Proposed technique generates well defined features set identified at two levels; the first is the physical layout and textual properties of individual research article, which are then used to develop the generic set of features. These features can be used to extract logical layout content of articles from publishers with diversified composition styles. 2) Our technique evaluates all the logical layout content present in an article, unlike other techniques which are task specific only. 3) It does not depend on a single feature set as in few scenarios physical properties are not extracted correctly from a PDF file. 4) We have proposed a scalable multistage framework, so the future updates can be handled at any level. 5) The technique extracts unique metadata hidden in the content of the logical layout structure. 6) The technique is evaluated using gold-standard dataset with evaluation tool, which is publicly available online.

III. METHODOLOGY
We covered hypothetical, theoretical and experimental aspects in our research methodology. Hypothetically, the content of the research documents is presented in different layout and formatting styles which makes it easier for humans to comprehend different parts and sections of a document. Most of the documents share common formatting styles which makes them easily readable. Theoretically, we have analyzed different formatting styles of publishers and established that these layout and formatting styles can be used to extract metadata from research articles. Since this important information and layout components require annotations, therefore, we have categorized the formatting styles into two types of structural components, one is the physical layout and other is logical layout structure components. The physical layout is based on individual article's distinct features, which consists of textual properties, geometric boundaries, paragraphs, column styles, floating object, headers and footers etc. The logical layout structures (LLS) are generic formatting features to identify different parts, contents and sections of an article that are required by the publisher. The system and proposed methodology flow diagram is shown in Figure 1.
FLAG-PDFe takes research article as an input in PDF format. The first stage extracts physical layout of a PDF file and text chunks along with geometrical location and font property. These text chucks are processed and organized in the form of text blocks, with correct reading order and document formatting style aware. In the second stage, text blocks geometric and textual properties are used to create feature sets, which are used for classification algorithm to extract logical layout structure (LLS) components of the research articles in the third stage. Finally, heuristics are applied on LLS to get desired metadata, which is sorted and stored in csv output form. In preceding sections of the paper every process is explained in their chronological order, and in the next section, the formulation and extraction of the textural information is discussed.

A. DATASET
In order to develop a comprehensive model which can be used on diversified publishing styles, we chose ESWC 2016 challenge task 2 published dataset. Various gold standard datasets from ESWC challenge are available at the link 13 along with an evaluation tool. This dataset consists of research articles having diversified format and styles adopted from publishers like ACM, LNCS, and IEEE. The dataset has two parts, first is the training dataset (TD), which consists of 45 research articles and second part test dataset (ED) consists of 40 research articles. Initially, we used training dataset (TD) of ESWC for model construction. The evaluation of model was done on test dataset (ED). The output of the ED contains of 320 CVS formatted files. We evaluated the output of the proposed system at different stages on bases of comparison done with gold standard dataset. ESWC 2017 challenge task 2 14 published a test dataset (TD) containing 40 research articles. Conference organizers have not published evaluation dataset (ED) along with evaluation of proposed techniques. However, we have also used TD dataset to further evaluate the performance of our proposed model.

B. PHYSICAL LAYOUT AND PRE-PROCESSING 1) PHYSICAL LAYOUT EXTRACTION
A PDF file is composed of raw binary data without any associated metadata and logical structural information that identifies different layout categories of the content. Therefore, the first process is to extract the textual information from the PDF file. At this stage, we used itext [36] open source java library that provides faster and reliable method to extract PDF file. Unlike other processing tools that extract text as text glyph or stream of characters, itext extracts chunk of textual elements that reduces resources and computational cost. Further, itext implements advanced strategy to extract structural components that are text chunks, font properties, geometric locations, raster images, page numbers, and vector graphics. The text chunks are retrieved, encapsulated in boundary boxes that identifies their geometric position in form of (x, y) coordinates on the page along with height and width. The itext library returns font attributes like font name, font size, bold, italic, orientation etc. We used these attributes to generate font properties feature set.

2) COLUMN STYLE IDENTIFICATION
The research documents are composed in single or double column style. This process identifies the column style of the document in order to determine the boundary of the main text body. The column layout style further helps to identify the geometric position and layout properties of the text blocks. The process first calculates the right and left outermost margins of the page. The left outermost margin is calculated by the MODE of minimum values of text blocks geometric start point, and the right outermost margin is calculated by the MODE of maximum values of text blocks geometric end value. Thereafter, the process calculates the number of columns present in the document. The process starts from left outermost margin and calculates MODE of maximum values of text blocks geometric end value. If the value is equal to right out most margin, the process stops, else process again computes the MODE of minimum values of text blocks geometric start point, till it finally reaches to the left outer margin. This process also helps to identify text blocks present in the form of decorations and footnotes.

3) CORRECT READING ORDER
Earlier systems use heuristics-based X-Y cut algorithm and KNN based Docstrum algorithm to correct the reading order of the document. Although, the output of the itext library is mostly in correct rendering order but it contains irregularities while extracting the text reading order. The reading order irregularity is due to in-text citations, algorithms, tables content, vector graph-based figures content, special characters and floating text objects. We corrected the reading order of text chucks using neighboring text geometric distance and rendering order. The process derives words from received scattered chunks of text and on the basis of geometrical location and physical distance among them. The words grouped together to formulate lines while retaining the text features of individual text chunk. This process produces plain texts having no relationship between words and lines and paragraphs. The line numbers are assigned by computing the reading order and rendering order of the content of text blocks, text with same geometrical position and column had same line numbers.

C. FEATURE EXTRACTION
The logical layout consists of font and formatting style. We have developed the approach for our system that can recognize different components of the document, based on font textual and geometric properties. We have analyzed all the possible layout variants present in the training data-set and, built the features set based on those textual properties. The features set is used by machine learning model in the next stage.

1) PHYSICAL AND LOGICAL LAYOUT EVALUATION
The logical layout determines the document's layout components comprised of title, authors and affiliation, figures, tables captions, heading and levels, paragraphs, bibliography etc. A PDF file most often lacks metadata tag associated to an individual logical layout category to support automatic retrieval or identification of required content. We have developed a framework to address this core issue by extracting the logical structure categories of PDF-based research articles to generate layout aware output.

a: FONT PROPERTIES
The itext library extracts the font properties of the text characters or chunks from a pdf file. However, the font name contains all the font information having concatenated itext font code, font family name, bold or italic information, which requires further processing to extract individual font properties. Most often, individual categories of text blocks possess different font features, like section headings composed in bold or italic to mark as prominent. Based on font families there is variation in identification of the bold and italic properties. As ''Times'', ''Arial'', ''SegoeUI'' and ''Nimbus'' etc. font families contain ''bold'' or ''italic'' keywords and Computer Modern ''CM'' fonts has ''BX'' for bold and ''TI'' for italic font style as represented in Figure 2.

b: NEIGHBOR DISTANCE
The reading orders helps to assign line numbers to individual text line. The line number enables the system to identify the sorted order of main body content, sections heading and bibliography. However, the sorting of table numbers and . The text blocks share common text properties, but the table caption distance from paragraph is more as compared to distance among paragraph lines.
figure numbers cannot be guaranteed. The feature measuring distance of line from top and bottom of page calculates the sorting order of the content; also, this feature enables the system to identify text blocks that are composed close to the far boundaries of the page. The distance from adjacent lines helps to identify continuity among text block to form paragraphs. However, this parameter is conclusive when the Font properties are same between distinct text blocks. Like section heading, figure and table caption have same text size and font properties as compared to body text as illustrated in Figure 3.

c: TEXT LOCATION
The column style helps to determine the text location features of text blocks based on the presence of text block or line in a column. During the identification process of external boundaries, the single or double columns styles were identified. In this stage, the text blocks location in a column is defined. The documents with single column style have text blocks existing in column number one. However, with double column style a text block can exist in column number one or two, and text block that does not reside in any column is assigned with column number zero like title etc. The In Column feature has the information regarding column number of a text block. The align feature identifies the left, right or center alignment of text block with in a column. Figure 4 represents the identification of alignment of text blocks where the main section heading is center aligned within a column and starting line of each paragraph is right align and rest lines are left align. The distance of starting point of text line with reference to column start is present in start indent feature and ending point distance from column end is present in end indent feature.

d: FONT TYPOGRAPHY
The font typographical features facilitates in the identification of title, section headings and levels. Research articles have section headings in different typographic where heading has text in capital case or title case format. Here, the identification of initial capital words require some pre-processing as they may contain prepositions in small case letters. Therefore, we excluded the prepositions and then checked the initial capital phrases. In Figure 5, these text case features are found in the section heading or title of research article. Another, important typographic feature is the initial numeric values to define the heading number or heading level. The heading numbers are defined either by a numeric value or a roman value as shown in Figure 5. The sub headings in such scenarios have outline numbering styles, the system counts the number of dots and eliminates if it is present at the end of the number hierarchy.

e: LEXICAL PROPERTIES
The research documents have meaningful content that enables the system to identify the logical layout components. It has been observed that keywords-based search like ''Abstract'', ''Reference'', ''Bibliography'', ''General Terms'', ''Keywords'' and ''Acknowledgments'' etc. can be an effective method to identify the relevant sections. Therefore, the content before ''Abstract'' most often contains the VOLUME 8, 2020 title section of the document. Similarly, the content after the ''reference'' heading will have bibliographic information. The Acknowledgment section contains the funding sponsors for the project, and we shall use it in later part to extract funding Agency . The figure or table caption always starts Table'' etc. However, such keywords may exist at start of a paragraph but combination of these keywords along other textual feature can be helpful in identifying captions. Email's always have ''@'' character and efficiently build regular expression on text lines with these special characters can detect correct emails addresses.

D. LOGICAL LAYOUT STRUCTURE EXTRACTION
Logical layout structure (LLS) defines the layout of an article content and all research publishers provide guideline to authors to follow their layout and formatting style. The LLS components mostly includes Title and authors section, Section headings (TOC), headers and footers, table and figure captions, and reference section. Different publishers adopt diversity formatting styles to mark these LLS components. Therefore, we have proposed a generic set of features that can identify these variations, so that machine learning algorithms can effectively extract LLS for different publishers. In this section, LLS components extracted by our proposed approach are presented.
The authors, affiliation and country of affiliation are present in authors section. After manual evaluation of the research documents we identified that authors section is located after the article title and before abstract section on the first page of the document. Therefore, first part of structure extraction is the identification and labeling of Title, authors, author's affiliation, email and country of affiliation. This section contains salient font, geometric and lexical based characteristics, which are helpful in its content identification.To develop the model, the features were assigned to these text blocks and sections were labelled.
The table of contents (TOC) and textual paragraphs are the major components of an article. The system needs to identify the headings of each section. The heading font and geometric features are different from body font features. Therefore, font features facilitate in the identification of heading text. These features are comprise of capital/ bold and italic font, geometric distance from previous and current section body, geometric distance from column's border, or numeric or roman initials.
In research articles, mostly captions are present above or below the main body of tables and figures. The captions explain the content of their associated tables and figures. The captions do have dissimilarities from main body text. However, some styles contain caption text properties similar to body text and caption number has dissimilar font properties. This posted a challenge for annotation based on overlapping features, therefore we only tagged the initial value of the figure or table caption before sequence number. We additionally used keyword phrase feature to discover table and figure captions. Keyword phrase marked by matching initial word of text chunk with any of the matching keywords like Table, Tab, Figure, Fig and Viz. We have used combination of these approaches for efficient extraction of the Table and  Figure captions. The acknowledgments section mostly has a heading ''Acknowledgment'' and has heading textual style. This section recognizes the funding agencies and individual helped materialize the research work. We used both textual style and keywords to identify this section's heading and further complete acknowledgement body is marked for next stage to extract funding agency and EU project information.

1) DISCUSSION
Before the setup of a machine learning algorithm to extract logical layout structural components from diversified layout styles. Theoretically, a few points are considered regarding the properties of the features and the nature of the problem. As earlier described the features are of different datatypes like numerical, nominal and boolean. The problem and properties shows a non linear relationship between the features. Its a classification problem with multiple class labels. The number of features are lesser then the training data instances n p. Based on the facts described, we only selected the machine learning algorithms for evaluation to prove our theoretical evidence that best fits for non linear, distinct features and multi class labels, on a large dataset. In succeeding subsection, we shall present a brief overview of different machine learning algorithms that we evaluated for our proposed methodology, as comprehensive details and computational complexities are available [37]- [41].

2) THEORETICAL EVALUATION
The evaluation is based on n, representing the number of the training samples, where p is number of the features. For tree base classification algorithms n trees represents the number of the trees. Similarly, the number of the support vectors is donated by n sv and finally, n li is the number of neurons at a layer i in a neural network.
Naïve Bayes algorithm depends on the conditional probability based on Bayes theorem, and generates a tree based on probability known as Bayesian Network. It's characteristics are independent from each other. The time complexity is linear for both testing O(n * p) and prediction O(p) of the model. Posterior probability is calculated by P(A|B) where,

P(A|B) = P(B|A)P(A) P(B)
k-nearest-neighbor are defined in the terms of distance of all instances that correspond to the point in n-D space. It searches the pattern space of close unknown tuple for k training and classify it by a majority vote of its neighbors. The distance metrics, such as Euclidean distance, Manhattan and Minkowski are used to define ''closeness''. The time complexity can be reduced to a constant O(1), independent of training dataset of |D|.
The decision tree classifier, constructs the tree based on entropy and information gain by using ID3 algorithm unlike standard deviation reduction method. The nodes represents the attributes needed to be classified, while the branches represent the allowed value. A full homogeneous sample achieves entropy equal to zero by diving the sample into equal parts. The time complexity to train the classifier is O(p * nlogn), and for prediction is O(p).
The ensemble classifiers use combination of models to increase the accuracy [42]. Different methods can be applied, where improved model M * is created with combine series of k learned models {M 1 , M 2 , M 3 , . . . .M k } on data D with k learned sets, {D 1 , D 2 , D 3 , . . . .D k }. The bagging method considers majority vote by models to improve the accuracy and the term bagging origins from ''bootstrap aggregation''. In Adaboost (Adaptive boosting), assigns weights to each classifier's vote for each training tuple to boost the accuracy of learned method. The weight is calculated on errors due to misclassification and subsequent model focus on classified tuples. Weight is calculated using log 1−error(M i ) error(M i ) . Stacking is a heterogeneous ensemble that consists of different models. The idea is to combine predictions of the base learners (level-0), do not just vote and provide as an input to meta learner level-1 models. The Random forest ensemble the decision tree classifiers so that the collection of classifiers is a ''forest''. Each tree depends on the independently sampled values and all the trees has same distribution in the forest. The accuracy is achieve using each tree's vote and the most popular class is returned. The support vector machines classifies both linear and nonlinear data. It transforms the training data into a new higher dimension by using nonlinear mapping and searches for linear hyperplan. SVM uses support vectors to find this hyperplan. The tuples of different classes are separated using ''decision boundary'' or margins. The maximum distance between margins and classes are drawn. Finding maximum marginal hyperplane (MMH) and support vectors makes it a quadratic optimization problem. For linear data, linear SVM is employed and for nonlinear data SVM provides a bag of K (x, x ) kernel tricks.
where d is specified by keyword degree, r by coef0 and γ is specified by keyword gamma, must be greater than 0. The overall training time complexity for kernel method is O(n 2 p), and prediction time complexity is O(n sv p) [43].
The Neural networks [44] are non deterministic algorithms that generalizes well but have minimum mathematical foundation. They are learned in an incremental fashion, and nontrivial multilayer perceptrons are used to perform complex functions. Supervised, unsupervised and reinforcement are three main types of artificial neural networks. The time complexity is O(n * e( h−1 i=1 nl i nl i+1 )), where e is epochs and h is total number of layers in a neural network.

E. METADATA EXTRACTION
This is the final stage to identify metadata and structural information of the research document. This desired metadata is extracted from different logical layout structural (LLS)components. This section applies heuristics on the content of LLS to extract metadata and stores them in machine comprehensible form in order to perform task specialized queries.

1) AUTHOR AND AUTHOR AFFILIATION
In previous section, we have used machine learning approach to identify different elements of authors section. It has been observed that this information is available in three style formats.
1) ''Sequence of author names separated by commas' or tab spacing, then sequence of affiliations''. 2) ''Sequence of author names with numeric or symbols separated by commas' or tab spacing. Then sequence of related affiliations with numeric or symbol''. 3) ''Group of an author's name, author's affiliation, and email address''. The itext library provided an edge here, as the output text rendering is in the sequence of above mentioned format styles. We applied parser based on regular expression in order to separate authors and assigned them reference ID. This id is based on sequence of rendering, numeric and symbols. Thereafter, the affiliations are assigned with authors id's based on sequence of rendering, numeric and symbols. The process generated a bipartite graph of authors and affiliations.

2) COUNTRY OF AFFILIATION
A knowledge-based library is employed having country names, city names and country domain name like de, uk etc. After retrieval of author's affiliation, the country name and city names are extracted based on comparison made with knowledge-based library. If that affiliation has missing country information, then we parse email id domain name and compared it with country domain name to extract county of affiliation. Finally, distant list of countries is stored.

3) HEADING LEVEL 1
Heading levels identification is a challenging task, different complex models proposed in the literature, efficiently identifies the table of contents. To extract table of content of a book, heuristics based on TOC identification methods using information present in TOC section are employed. However, such approaches are not suitable for research articles. In the previous section, the level 1 headings were annotated along with level 2 and level 3 heading and output was based on classification model. However, ESWC challenge task is only to identify level1 heading, therefore no further processing is done on output of previous stage and extracted heading is stored in ascending order.

4) TABLE AND FIGURE CAPTION
In the previous section, classification model identifies the start point of table or figure caption before the sequence number. At this stage, the remaining text chuck is analyzed. The process starts from the sequence number of table or figure and breaks when next text line has different line spacing, by which multiple lines and different text properties do not break the complete caption sentence. The system further stores the caption of table and figure in ascending order based on the sequence number.

5) SUPPLEMENTARY MATERIAL
The identification of supplementary material is part of ESWC challenge and this information is present in the footnotes. The textual properties are different from main text body properties and starts with a numeric or footnotes symbol identifier. The supplementary material is in the form of URLs. We have converted all text of footnotes in a single text block and then utilized a URL parser 15 using regular expression which extracts the complete URL from descriptive part.

6) FUNDING AGENCY AND FUNDED PROJECTS
The acknowledgments section contains the funding agencies and funded project information. We have used task specific knowledge-based approach to identify funding agency name and funded project name. The training dataset TD is analyzed and a regular expression is developed to extract funding agency by locating keywords starting with 'contri', 'support', 15 https://docs.python.org/2/library/re.html 'fund', 'grant' and ends with 'from' or 'by' and the expression ends with ''brackets'', ''quotation marks'' or ''punctuation marks''. The Parser recognizes the funding agency name along with its acronyms and finally removes the preposition and punctuation around the funding agency information.
The final metadata extracted by our system is the list of funded projects. After manual analysis of the content we observed that this information is also available in the acknowledgement section and placed after the funding agency name if available. The funding project name is placed between or after the keywords ''the'' and ''project'' like ''by the EU FP7-ICT-2011-8 project''. The regular expression finally removes the keywords and parses around the content.

IV. RESULTS
To evaluate the results, standard evaluation measures like recall, precision, and f-measure are mostly employed. These methods are based on classification parameters known as true positive (TP), false positive (FP), true negative (TN), or false negative (FN). Recall (sensitivity) is a statistical measure used to judge the relevant results produced by the model. Precision analyzes the quality of results. F-measure is the harmonic mean to measure test quality based on Recall and precision.
We have evaluated different classification-based machine learning algorithms on the given training dataset (TD). We have performed comprehensive evaluation of each machine learning approach using confusion matrix parameters.

1) EXPERIMENTAL SETUP
The training of the models is performed by using k-fold cross validation technique, where k = 10 produced optimum result. In order to improve the performance and efficiency of the model, we have performed feature reduction by first converting categorical values to numeric values while excluding non-convertible values, and used chi-squared (chi 2 ) to select K best features. We trained and tested the selected models described in subsection of theoretical evaluation III-D2. The euclidean distance method performed better to find k-NN, where k = 5 produced optimum results. We further evaluated ensemble classifiers bagging, Adaboost and Stacking with input of classification models used in current experiments. We have followed the guide lines of [45] for the construction of the SVM model. We set different kernel functions like linear, polynomial, Gaussian-RBF and sigmoid. In order to avoid the issue of over fitting, we choose the C value of 1 and γ value equal to 10, and by selecting Gaussian-RBF as kernel function, produced the optimal results among all the classification algorithms that we evaluated for our approach.
The Table 1 illustrates the comparison of average recall, precision and f-measure of all the classification models using TD. The results show the support vector classification (SVC) classified correctly more relevant structural components. The Table 2 shows the details of features that are finally selected for extraction of LLS components by our selected classification model. The output of this stage will be used to evaluate content present in related sections and final metadata will be generated. It also reveals that our generic feature set extraction approach has played a pivotal role to correctly identify the logical layout structure ''on the fly''.

B. METADATA INFORMATION EXTRACTION
In this section, results of extracted metadata in the document are presented. We have evaluated authors and author's affiliation, country of affiliation, sections (heading level 1), Table and Figure Captions, supplementary material, funding agency and funded project. The recall, precision and f-measure are measure of each element and the mean value of these measuring methods are calculated against each metadata element. The final model results presented in Table 3 reveals that average recall = 0.877, precision = 0.928 and F-Measure = 0.897. Finally, we have compared our results with start-of-theart on gold standard [46]. The previous approaches and FLAG-PDFe used same dataset and are evaluated for same evaluation parameters. In Figure 6, our approach is compared with state-of-the-art, and results suggested that our approach showed significant improvement from previous approaches, and the results indicate that FLAG-PDFe has 16% performance gain on the SemPub2016 winner.  Additionally, we evaluated the performance of our proposed framework on the TD consisting of 40 research paper from SemPub2017 challenge. On the bases of our TD dataset from SemPub2016, we evaluated results on 7 parameters that our technique extracts. The results presented in Table 4 reveals consistent performance of model that average recall = 0.833, precision = 0.949 and F-Measure = 0.860.

V. CONCLUSION
In this paper, we have proposed a comprehensive framework ''FLAG-PDFe'' for the extraction of metadata from PDF based research documents. The system converts the PDF file into metadata annotated files using classification model and heuristics. The system extracts text blocks, typography, and geometric information from a PDF raw file and reshape these features to identify and extract the logical layout structure and metadata of an article. The proposed approach consists of a novel four-stage process. The first step, the distinct features present in an individual document, are identified to extract physical layout of an article like main text boundaries, column style, and reading order etc. and further pre-processing is done to segregate paragraphs and floating objects. The second stage develops the generic features using physical layout, typographic and geometric information, which can be mapped on diversified publishing styles. In the third stage, we evaluated different machine learning methods and generic features to extract logical layout structure (LLS), the experiments reveal that support vector classification (SVC) algorithm performed best with the proposed generic set of features. Finally, the logical layout structure is further analyzed to extract desired metadata based on knowledge based and heuristics. The system outperformed previous approaches, when evaluated on gold standard (CEUR dataset).
Our study established the fact that each research article has its distinct physical layout properties, although it follows the formatting guidelines of the publishing conference or journal. Publishers use layout and formatting styles to differentiate different logical layout structural (LLS) components, therefore, generic set of features can be developed to identify logical layout components or sections for diversified publishers. The proposed approach develops both distinct and generic features used by classification algorithm on the fly, in order to recognize varying publishing styles. In future, we intend to extend stage four by extracting metadata information like subsections, bibliography, and publishing information by employing novel algorithms for natural language processing (NLP) and evaluation on additional editors.