A Hybrid Approach and Unified Framework for Bibliographic Reference Extraction

Publications are an integral part of a scientific community. Bibliographic reference extraction from scientific publication is a challenging task due to diversity in referencing styles and document layout. Existing methods perform sufficiently on one dataset however, applying these solutions to a different dataset proves to be challenging. Therefore, a generic solution was anticipated which could overcome the limitations of the previous approaches. The contribution of this paper is three-fold. First, it presents a novel approach called DeepBiRD which is inspired by human visual perception and exploits layout features to identify individual references in a scientific publication. Second, we release a large dataset for image-based reference detection with 2401 scans containing 38863 references, all manually annotated for individual reference. Third, we present a unified and highly configurable end-to-end automatic bibliographic reference extraction framework called BRExSys which employs DeepBiRD along with state-of-the-art text-based models to detect and visualize references from a bibliographic document. Our proposed approach pre-processes the images in which a hybrid representation is obtained by processing the given image using different computer vision techniques. Then, it performs layout driven reference detection using Mask R-CNN on a given scientific publication. DeepBiRD was evaluated on two different datasets to demonstrate the generalization of this approach. The proposed system achieved an AP50 of 98.56% on our dataset. DeepBiRD significantly outperformed the current state-of-the-art approach on their dataset. Therefore, suggesting that DeepBiRD is significantly superior in performance, generalized, and independent of any domain or referencing style.


I. INTRODUCTION
There has been a rapid increase in every field of research since the start of the 21st century, subsequently increasing the volume of scientific literature exponentially [1], [2]. Each scientific publication consists of several components i.e. header, abstract, sections, and references. Bibliographic references play a vital role in every publication as they serve as a citation network and forms a foundation of the information provided in the publication.
Bibliographic references are of particular interest to library communities [3]. They play a key role in compiling library catalogs. These catalogs contain information regarding all bibliographic items like books, journals, conference proceedings, magazines, and other media present in a library. For such The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang . a purpose, it is not feasible to manually find and index such a huge volume of references.
Resource Discovery Systems pose to be a viable solution for the libraries to further expanding their horizon by providing the indexed data available from external resources [3]. Some resources are commercial and are thus paid to use their collected data i.e. Web of Science, Scopus. According to a scientometric study, [4], both Scopus and Web of Science have mostly coverage of English journal articles from the Biomedical and Social Science domains and thus have low overall coverage for journals and articles from other domains and languages. Therefore making the Resource Discovery Systems a sub-optimal solution for bibliographic cataloging.
The majority of the related work on the problem of bibliographic reference detection are text-based solutions and therefore make use of textual features like author names, publication titles, etc. in a document to detect references. Text-based approaches use a set of carefully crafted heuristics and regular expressions [5], [6] based on the position of constituents of a reference string i.e. author names, affiliations, publisher, journal/book/conference name, year of publishing, etc. These heuristics and regular expressions are very sensitive to the textual features as they do not anticipate any number or special character like brackets etc at the start of a reference string thus such references are missed altogether. Text-based approaches are also very sensitive to the layout of the references for example if all lines of reference strings start from the same point then text-based approaches find it very hard to identify the starting and ending boundaries of a reference string thus either detecting multiple references as one or identifying one reference as multiple references. With the introduction of such cases, carefully crafted heuristics become deprecated right away thus making text-based approaches less robust and eventually not generalizable. For instance, most common referencing styles like MLA and APA have author names and publication titles as the starting features of a reference string. On the other hand, there exist some rare bibliography styles in social sciences such as Alpha or hybrid Chicago in which reference strings either start with a reference identifier or publication year respectively. It is an example of a problematic case for text-based approaches because the references in the given example do not comply with the other common traditional referencing styles and are rarely used. Comparison of results from text-based and layout-based approaches on different referencing styles is shown in Fig 1. It can be observed that the text-based approach was unable to detect an unusual reference string as it entirely relies on textual features while overlooking other important facets i.e. layout features.
This paper introduces an automatic, effective, and generalized approach for reference detection from document images. It works equally well for scanned and digital-born PDF documents. Our approach is inspired by the way how human beings perceive and identify objects. To understand this phenomenon, consider an example of an illegible blurred document containing some text. Although the document is unreadable, however, we can still identify paragraphs, bullet points, and similar references. The underlying idea states that layout information is the key to identify different textual structures in a document even without using textual features. For this task, we employed Convolutional Neural Networks (CNN) for representation learning based on the layout of a document. This waives the dependency on textual heuristics used in the majority of the existing systems. Our approach is generic and is thus applicable to any bibliographic publication independent of its domain or referencing style. We also release a benchmark dataset for bibliographic reference detection from document images. We performed benchmark tasks to evaluate the performance of Deep-BiRD from different aspects i.e. generalization, robustness, etc.
In this paper, we also present a comprehensive framework called BRExSys, which encapsulates all state-of-the-art bibliographic reference detection methods under a single umbrella, allowing users to use any of the existing or proposed methods in one place. BRExSys supports scientific publications in a number of file formats i.e. born-digital PDF, Scanned PDF/images, HTML, XML etc. BRExSys provides a user-friendly interface to facilitate the smooth processing of the input file and visualization of processed output. BRExSys is a highly customizable system as it can be tailored based on the user's requirements.
The contributions of this publication are as follows: • We present a novel layout driven approach for automatic reference detection from scientific publications, which effectively exploits the visual cues to firstly identify bibliographic references from a given scientific publication.
• We release a new & larger dataset for image-based reference detection which will be publicly available for the community.
• We demonstrate the superiority of the proposed approach by carrying out a series of comparative performance evaluations against the previous state-of-the-art approach.
• We present an automatic bibliographic reference detection framework called BRExSys which has integrated DeepBiRD and other state-of-the-art text-based reference detection models to take advantage of different modalities for the task at hand. 217232 VOLUME 8, 2020 The rest of the paper is structured as follows: Section 2 discusses the relevant work done so far on the problem of reference detection. Section 3 discusses the details of the datasets used in this work. Section 4 presents the architecture and pipeline of our proposed approach. Section 5 discusses different experimental setups used for this publication along with analyses of the results obtained from different experiments to demonstrate the effectiveness of our approach. Section 6 discusses the details of our proposed bibliographic reference extraction system called BRExSys. And lastly, Section 7 makes the concluding remarks of this paper.

II. RELATED WORK
A lot of work has been done in the field of reference detection. Bibliographic reference detection is generally performed by two methods like text-based and layout-based. Most of the approaches are based on the analysis of textual content to identify references. There are several techniques employed by each text-based approach to identify references. Here we will discuss such techniques used for bibliographic reference detection, starting from the simplest and moving towards more sophisticated ones.

A. TEXT-BASED APPROACHES
The simplest of the text-based reference detection techniques employ regular expressions and carefully crafted heuristics [5] for this task. Such approaches are mostly not considered as an optimal solution because of their limited coverage. For example, MLA and APA are the most common referencing styles in which a reference string starts with author names. To detect such references, adopted heuristics will look for comma-separated author names at the start of the reference string. The drawback of such an approach is that it will be unable to detect a reference if it does not comply with the defined heuristics i.e. reference string with Alpha style where reference string starts with a custom ID. Every domain has its unique referencing style and sometimes there are multiple referencing styles within one domain. Such challenges make simple approaches unsuitable for this complex task.
Citation-Parser [5] is a typical example of heuristics based tool. To identify components of bibliographic reference string i.e. authors, title, conference/journal, etc, it employs a set of carefully designed heuristics. Sautter and Böhm [6] proposed a tool named RefParse which exploits similarities between individual reference strings to identify different referencing style for parsing a reference string. Perl also provided an extension named Biblio [7] for parsing and extracting reference string metadata. Chen et al. [8] proposed BibPro, an approach that identifies citation style by matching it with referencing styles available in its database and then uses gene sequence alignment technique to identify components of reference strings. AnyStyle-Parser [9] is another example of a tool which identifies bibliographic references using heuristics. PDFSSA4MET [10] proposed a slightly different approach to identify references in a Born-digital PDF. In this approach, textual PDF is firstly converted into an XML file. Then by employing pattern matching mechanisms, syntactic and structural analysis of XML is performed to identify the reference section.
Ahmed and Afzal [11] used a diverse range of features i.e. font type, neighbor distance, text location, font typography, and lexical properties to identify components of a scientific publication and later extract metadata like Authors name, affiliation, email, headings, etc. Boukhers et al. [12] proposed an approach in which all text lines are individually classified using a pre-trained random forest model with the probability to be a potential reference line and later uses the format, lexical, semantic and shape features to identify and segment reference strings.
Lafferty et al. [13] proposed an advanced approach known as Conditional Random Fields (CRF). CRF is a probabilistic approach for labeling sequence data like reference strings. This labeling includes identifying different parts of a reference string i.e. authors, publication title, year, conference/journal name, etc. Such labeling assists in recognizing a reference string based on its labeled components.
Tkaczyk et al. [14] proposed ''Content ExtRactor and MINEr (CERMINE)'' a CRF based system for extracting and mining bibliographic metadata from references in born-digital PDF scientific articles. Free-cite [15] computes features from tokenized citation string and then classify that token sequence using trained CRF. Science Parse [16] is a tool based on CRF to identify and extract metadata of references from a document. Matsuoka et al. [17] demonstrated the use of lexical features by CRF results to gain an increase in accuracy. Councill et al. [18] presented a CRF based package called ''ParsCit'' for reference metadata tagging problem. In which the reference strings were identified from plain text, based on fine-grained heuristics. [18] claims ParsCit to be one of the best known and widely used open-source system based on Heuristics and CRF for reference detection, string parsing, and metadata tagging. Tkaczyk et al. [19] also proposed a reference metadata recommender system which provided 10 most popular open-source citation parser tools in one system. Selected tools were a mixture of simple heuristics based and machine learning-based solutions.
Nowadays, artificial neural networks are the most popular choice as a solution to most scientific problems. Similarly, some literature also explored the potential of neural networks for the task of bibliographic reference detection and parsing.
Zou et al. [20] proposed a two steps approach to locate and parse bibliographic references in HTML medical articles. In the first step individual references are located using machine learning approaches whereas in the second step by employing CRFs, metadata is extracted from each reference.
Contrary to the traditional approaches for reference tagging, Parsad et al. [21] proposed a bibliographic reference string parser named ''Neural-ParsCit'' based on deep neural networks. The authors tried to capture long-range dependencies in reference strings using Long Short Term Memory (LSTM) [22] based architecture. Lopez [23] proposed a tool named ''Grobid'' based a tool based on conditional random fields for detection and extraction of publication headers, bibliographic references and their respective metadata. The Grobid model was trained on multi-domain, manually annotated data containing 6835 instances. Recently, Grennan Beel [24] performed experiments to train a CRF-based solution [23] on actual citation parsing data annotated by humans and synthetic data and suggested that the model trained on synthetic dataset performed very similar to the model trained on original data.
Text-based approaches are not directly applicable to document images. To identify references from scanned documents, the text must be extracted from a given document by performing Optical Character Recognition (OCR) and then applying the selected approach to the extracted text. The disadvantage of this approach is the potential introduction of OCR error which will eventually contribute to detection error thus making the task unnecessarily complicated.

B. LAYOUT-BASED APPROACHES
The literature discussed so far relies only on textual features to identify references. Text-based approaches do not take advantage of layout features thus abandoning an important aspect. There are very few approaches that explored the potential of exploiting layout information for detecting bibliographic references.
Bhardwaj et al. [26] used layout information to detect references from a scanned document. For that purpose, Fully Convolutional Neural Network (FCN) [27] was used to segment the references and later post-processed to identify individual references. To our best knowledge, it is currently the state-of-the-art for the image-based reference detection task. The authors also released a small dataset [25] for image-based reference detection. In this paper, this dataset will be referred to as BibX dataset. Lauscher et al. [3] used this layout based reference detection in their system [28] to build an open database of citations for libraries indexing use case. Recently, Rizvi et al. [29] gauged the performance of four state-ofthe-art object detection models using layout information to detect bibliographic references in a scientific publication.

A. BibX DATASET
This section provides insights about the dataset used for training and baseline performance comparison of Deep-BIBX [26] and our proposed approach DeepBiRD for the task of layout-based reference detection. To the best of the authors' knowledge, it is the only image-based dataset that contains annotations of references. BibX dataset consists of 455 document images from several social sciences books and journals, containing 429 and 25 document image samples from single and double column layouts respectively. The dataset is divided into train, validation, and test set with 287, 25, and 143 samples respectively. Distribution details of BibX dataset are mentioned in Table 1. Furthermore, considering   the limited size of the dataset, we propose a new dataset called BibLy dataset. Details of this new dataset are discussed in the following section.

B. BibLy DATASET
In this paper, we are releasing a dataset named BibLy [30] for image-based reference detection. This dataset has been curated from the reference section of various Journals, Monographs, Articles, and Books from the social sciences domain. The resolution of images varies from 1500 to 4500 for the larger side of the image. Image quality is maintained on at least 300 dpi. All images were manually annotated where a box was drawn around every single reference.
There are 2,401 scanned document images in BibLy dataset containing 38,863 references in total. Document scans were initially divided into three groups based on the number of columns i.e. single, double, and triple columns. Table 2 shows the distribution of samples in layout groups. These groups were further distributed into train, validation, and test set with balanced representation from each group.  references are shown in Table 2. Dataset is shared on the following link: https://madata.bib.uni-mannheim.de/283/.

IV. DeepBiRD: PROPOSED APPROACH
In our proposed approach, we exploit layout information to detect references from a given document image. Firstly, we pre-process the input document image and incorporate more layout information to facilitate bibliographic reference detection. Later, references are detected from each pre-processed image. Fig 3 depicts the complete pipeline of our proposed system. Details of our pipeline are discussed as follows.

A. PRE-PROCESSING
The first stage in our pipeline is pre-processing followed by reference detection. In order to highlight layout features, we obtained a hybrid representation, which highlights the important content and helps the automatic representation learning approach to extract discriminative features. This hybrid representation is achieved by applying different transformations to the input image. The pre-processing stage involves a series of steps, which are elaborated as follows:

1) DISTANCE TRANSFORM
Distance transform provides the distance between each pixel and the nearest input foreground pixel. This way, we can highlight the separation between words, lines, and characters which later proves to help identify and separate individual references.
To apply distance transform, we first read the image as a grayscale image followed by the inversion of the image therefore switching bright pixels with dark pixels and vice versa. Then we binarize the input image using OTSU thresholding followed by inversion of all pixel values. Distance transform is then applied to the resultant binarized inverted image. We used several distance types in different experiments and selected Euclidean distance with a 3 × 3 mask as the most suitable distance measure. An example of euclidean distance transformation is shown in Fig 4b.

2) DILATION
We performed dilation on the input image to highlight text regions along with their surroundings to facilitate the neural network to identify lines and their respective scope more precisely. To perform dilation, we firstly binarized the input image using OTSU thresholding followed by the inversion. Then we perform dilation using a kernel of 1 × 5, this horizontal kernel merges the nearby characters in the proximity of the same line. The motivation of using a kernel of 1 × 5 is to preserve line separation while merging the words in the same line, therefore highlighting a line. A sample image is shown in Fig 4a.

3) HYBRID REPRESENTATION
It is the final stage of the pre-processing phase, in which we merge the representations obtained from dilation and distance transform with the input image. For that purpose, we place distance transform image, binarized image, and dilated image in channels one, two, and three of an image respectively. The resultant image retains information of the original image along with additional highlighted text lines and proximity information encoded into one image. In the hybrid representation, the blue color represents the proximity of the text and the separation between words and lines. While red color represents the color of the line. This image is later used to identify bibliographic references from a given document image. An example of the final image is shown in Fig 4c.

B. REFERENCE DETECTION MODEL
This section provides insights into the architecture design of the DeepBiRD.

1) ARCHITECTURE
For the reference detection task, due to the proximity of references a network was needed which can also separate references from each other in addition to detecting those references. For that purpose, we employed a deep neural network-based architecture known as Mask R-CNN [31]. It is one of the most popular networks for object detection and instance segmentation. The task of reference detection using layout features is a challenging task, as the references are a few pixels apart. So the detection task requires high precision to the level of each pixel. In contrast to the Faster-RCNN [32], the Mask R-CNN [31] is equipped with the ROIAlign which is a Non-quantized operation and therefore preserves the data. This resulted in more accurate detections to the pixel level. It served as one of the main reason to employ Mask R-CNN [31] for the task at hand.
In our experiments we used standard ResNet-50 [33] backbone for feature extraction. Table 3 shows the details of ResNet-50 [33]. Following the original implementation in [34], we followed the original parameters of [31] to train the network with batch normalization enabled. We used the pre-trained model ResNet-50 [33] to initialize the network. Then it was fine-tuned on the train set of BibX dataset with 287 images containing 5741 references, using transfer learning. For fine-tuning, we froze the first two blocks conv1 and conv2_x while leaving all the remaining blocks trainable. Figure 5 shows samples of feature maps from different feature extraction layers in the architecture.

2) PARAMETERS
The network was trained for 50 epochs with a base learning rate of 0.001. The learning rate was decreased in steps by a factor of 0.0001 at 12, 25, and 37 epochs respectively. In all experiments, the number of images per batch was set to 1.

3) INFERENCE
By performing inference on the input image we get coordinates of the detected reference's box along with a confidence score. The confidence score ranges between 0 and 1, where 0 being lowest and 1 being highest. It represents the extent to which the network is sure about that specific detection. Each detection in the results represents a reference. Once detection results are ready, OCR 1 is performed on each detected reference, thus extracting all references from an input image.

V. EXPERIMENTS AND RESULTS
To evaluate our system, we performed various experiments using DeepBIBX [26] model and multiple settings of Deep-BiRD on two publicly available datasets. These datasets include our own BibLy [30] dataset and the one proposed by DeepBIBX [26], here referred to as BibX dataset [25]. Due to the limited number of samples in BibX, the authors augmented the whole dataset followed by resizing every image in train, validation and test set. In this section, we will elaborate the results of the experiments performed for evaluation.

A. EVALUATION
In this section, we will discuss the evaluation results of all the experiments performed to compare DeepBiRD model with DeepBIBX in different settings. The purpose of this experiment was to validate the effectiveness of our approach on BibX dataset and compare its performance with DeepBIBX [26]. In this experiment, we trained DeepBiRD on BibX dataset with aforementioned parameters. We also trained a Fully Convolutional Network (FCN) [27] on non-augmented BibX dataset with exactly same settings as mentioned in the DeepBIBX original paper [26]. The difference being that in our experiments we used non-augmented dataset to enable results to be directly comparable with our approach. Additionally, we resized the training, validation or test set images and blurred its lines as mentioned in [26]. However, It is worth mentioning that using non-augmented dataset, will result in different evaluation results from Deep-BIBX [26]. Once the training finished, both models were evaluated on non-augmented test set of BibX dataset. By doing so it enabled us to directly compare the performance of our approach with DeepBIBX [26] approach on BibX dataset.
Each detection was validated using its Intersection over Union (IoU) with ground truth annotations. Both models were evaluated on different IoU thresholds ranging from 0.50 to 0.95 which is a standard for an object detection problem. A detection is considered as correct detection if the IoU of a given bounding box is greater than the IoU threshold. Table 4 shows comparison of DeepBiRD results with DeepBIBX [26] for experiment 1. The results show that DeepBiRD was able to achieve an average precision and average recall of 76.52% and 80.40% respectively. On the other hand DeepBIBX was only able to achieve an average precision and average recall of 32.51% and 23.28% respectively. Even at the lowest IoU threshold of 0.50, DeepBiRD was able to perform significantly better even more than a factor of 2.
The reason behind the strong performance of DeepBiRD is that it is based on Mask R-CNN [31] which performs semantic segmentation on shortlisted ROIs on the other hand FCN performs semantic segmentation on complete image.

2) EXPERIMENT 2: ROBUSTNESS
The purpose of this experiment was to validate the extent of robustness for both DeepBiRD and DeepBIBX [26]. To do so, we evaluated both systems on more unseen data i.e. test set from another dataset. The DeepBiRD and DeepBIBX models trained in the Experiment 1 were reused in this experiment. Both models were trained on Non-Augmented BibX dataset and evaluated on test set of BibLy dataset. The results from this experiment show the extent of effectiveness of Deep-BIBX [26] & DeepBiRD on unseen data. Both models were evaluated on a range of IoU thresholds ranging from 0.50 to 0.95. Table 4 shows the evaluation results of DeepBIBX [26] model on BibLy dataset. The results show that the performance of both models slightly decreased as expected when they are applied to unseen data. DeepBiRD was able to achieve an average precision and average recall of 64.53% and 70.50% respectively. On the other, DeepBIBX was able to achieve an average precision and average recall of 29.03% and 21.44% respectively. Therefore, outperforming Deep-BIBX by a significant margin similar to experiment 1 results.

3) EXPERIMENT 3: GENERALIZATION
The purpose of this experiment was to verify DeepBiRD for generalization by employing transfer learning to adapt the network to the BibLy dataset. In this experiment, the pre-trained DeepBiRD model on BibX dataset was used as a baseline and was then fine-tuned on the train set of BibLy dataset to learn more reference examples. Once the training was finished, the final model was evaluated against the baseline model on the test set of BibLy dataset.
The results of this experiment are shown in Table 4. The fine-tuned model was able to achieve an average precision VOLUME 8, 2020  and average recall of 83.40% and 86.60% respectively, which is 18.87% and 16.1% better than the results before fine-tuning of the model. However, precision at IoU of 0.50 and 0.75 increased by 9.16% and 20.19% respectively. This indicates that after fine-tuning, the model was significantly improved and was able to detect bibliographic references with a higher overlap. From these results, we can infer that DeepBiRD can be generalized as it can adapt very well to new data.

B. ABLATION STUDY FOR INPUT REPRESENTATION
This section discusses the results of the ablation study to show the impact of hybrid representation used in DeepBiRD. The purpose of this analysis was to determine the effectiveness of individual components in the pre-processing phase. For this analysis, we designed several experiments with different pre-processing configurations. In the first experiment, we employed the aforementioned pre-processing steps i.e. distance transform, dilation, and merging them with the original input image. In the second experiment, we employed dilation as a sole step in the pre-processing phase. Lastly in the third experiment, we excluded the pre-processing phase and used the original input image without any pre-processing. These experiments will highlight the contribution of individual components in the pre-processing phase towards the final output.
We used BibX dataset to perform this ablation study. Table 5 shows the results of different representation types employed in various experiments. The evaluation results show that the experiment which employed both dilation and distance transform along with merging channels sets the baseline average precision and an average recall of 76.52% and 80.40% respectively. In the second experiment, pre-processing consisted of dilation of the input image. This resulted in a decrease of 0.6% in both average precision and average recall, therefore suggesting that providing distance transform aided the proposed system to detect bibliographic references from a given document image. In the third experiment, the pre-processing phase was removed altogether and the original input image was fed to the system with no preprocessing. This resulted in a further decrease in average precision and average recall by 0.63% and 0.40% respectively. Therefore suggesting that dilation also contributed towards improving system performance.
To verify the trend in results, we performed the same ablation study on a second dataset BibLy dataset. Table 5 shows the results of this analysis. In the first experiment, with dilation and distance transform as a part of pre-processing, it sets the baseline evaluation average precision and an average recall of 83.40% and 86.60% respectively. The second experiment with pre-processing involving dilation of the input image decreased the average precision and average recall by 0.16% and 0.10% respectively. Whereas in the third experiment with no pre-processing, the average precision and average recall were further decreased by 0.08% and 0.20% respectively. These trends in results proved to be consistent that both dilation and distance transform play their part in further improving the performance of the system.

C. OVERALL DISCUSSION
In this paper, we presented a new layout driven reference detection approach called ''DeepBiRD'' which exploits human intuition and visual cues to effectively detect references without taking textual features into account. By meticulous experimentation, we pushed the boundaries of automatic reference detection and set a new state-of-theart. The evaluation results clearly show that the proposed approach DeepBiRD is an effective, robust and generalized approach by outperforming DeepBIBX [26] with significant  margins. For the sake of completion, we compared the performance of DeepBiRD with some other object detection models presented in the existing literature. [29] performed a benchmark of four object detection models on the BibX dataset. Table 6 shows the performance comparison between [29] and DeepBiRD. It is worth mentioning that we are using the results of DeepBiRD without pre-processing to fairly compare the results as there was no pre-processing used in [29]. Results show that our proposed approach DeepBiRD performed on average 13.38% better than each of the presented object detection models on BibX dataset. Faster R-CNN [32] was the best performing model among the benchmark selection. However, it was 12.35% worse than our proposed approach DeepBiRD. One of the possible reasons for Deep-BiRD superiority is its ROIAlign operation which not only preserves data but also results in precise detection up to pixel level. Fig 16, 17 & 18 show visual examples of best, average and worst results from our system. Results from DeepBIBX [26] and text-based model ParsCit for each example are also shown for comparison. All these results demonstrate the dominance of DeepBiRD over all other text-based or layout-based approaches. Trained model of the above mentioned experiments is available at the URL. 2 2 https://github.com/rtahseen/DeepBiRD

VI. BRExSys: A BIBLIOGRAPHIC REFERENCE EXTRACTION SYSTEM
This section discusses the details of our proposed framework. Our framework unifies all state-of-the-art bibliographic reference detection methods in one place to detect and extract references from scanned, markup, and textual documents. To take advantage of multiple models to the full extent, we provide various possibilities to use these models individually or in a fusion. The overview of our complete system is shown in Fig 6 and details of the proposed system are discussed as follows:

A. REFERENCE EXTRACTION FROM SCANNED DOCUMENTS
In this section, we will discuss pipelines specific for bibliographic reference extraction from scanned documents. The overview of these pipelines is shown in Fig 7. For scanned documents, we provide two pipelines i.e. Layout-based pipeline and Text-based pipeline. The layout-based pipeline is represented in blue color while the text-based pipeline is represented in green color. A scanned document can also be processed through both pipelines simultaneously and for such cases, results from both pipelines are included in the final output XML. VOLUME 8, 2020

1) LAYOUT-BASED PIPELINE
In a layout-based pipeline, we employed DeepBiRD, a stateof-the-art layout-driven reference extraction model. Provided a scanned document DeepBiRD performs bibliographic reference detection on the individual document image. Lastly, we employed ParsCit [18] to carry out Named Entity Recognition (NER) on each detected reference to identify reference string metadata like author names, publication title, publication year, etc. All the results are eventually returned in the form of a predefined standard XML file format.

2) TEXT-BASED PIPELINE
In text-based pipeline, we extract all the text from given scanned document and use it for text-based reference extraction. For this purpose we employed ParsCit [18], a state-ofthe-art text-based reference extraction model. Additionally, ParsCit [18] extracts reference string metadata by performing NER on extracted bibliographic reference strings.

B. REFERENCE EXTRACTION FROM TEXTUAL DOCUMENTS
This section discusses reference extraction pipelines from text documents like born-digital PDFs and plain text files. The pipeline overview for bibliographic reference extraction from textual documents is shown in Fig 8. We provide three pipelines for extracting references and their respective metadata from a given textual document. Each of these pipelines is discussed as follows:

1) TEXT-BASED PIPELINE (GROBID)
In this pipeline, we employed Grobid [23]. It takes born-digital PDF as an input and extracts bibliographic references along with their metadata from a given PDF document. Extracted data is returned in the form of predefined XML. Grobid does not depend upon other tools for text extraction from a given PDF document, therefore avoiding the introduction of a potential text extraction error.

2) TEXT-BASED PIPELINE (ParsCit)
In this pipeline, we firstly extract text from a given textual document and is further processed for bibliographic reference and metadata extraction. For this workflow we employed ParsCit [18]. It takes raw text as an input and extracts references along with its metadata from the given text.

3) LAYOUT-BASED PIPELINE
Tertiary workflow serves as an alternate solution suggesting that a born-digital PDF can also be processed as a scanned document using layout-driven reference extraction. In this workflow, we employed a state-of-the-art layout based reference detection and extraction approach called DeepBiRD. Details of this pipeline are already discussed in the section discussing reference extraction from Scanned documents.

C. REFERENCE EXTRACTION FROM MARKUP DOCUMENTS
In this section, we will discuss bibliographic reference extraction from markup documents like HTML and XML. Markup documents usually consist of a peculiar hierarchy. Depending on a known topology of a given markup document we provide multiple workflows for extracting bibliographic references and their metadata from markup documents. The overview of the pipelines is shown in Fig 9, where each pipeline handling a specific case is represented in a different color.

1) DIRECT MAPPING PIPELINE
The direct mapping pipeline deals with the case when we are fully aware of tags hierarchy in the markup document i.e. XML or HTML document from Zotero [38]. In such cases we perform tag-based reference extraction by targeting relevant tags like author name, title, publisher, etc, thus extracting all references from markup documents along with their metadata.

2) TEXT-BASED PIPELINE
Text-based pipeline deals with the case where we have partial knowledge about the tags hierarchy of a given markup document i.e. XML or HTML document generated from older versions of Zotero [38]. In such cases, we first extract all the text from the markup document using all known tags and then perform text-based reference extraction on extracted text. We employ ParsCit [18] for extracting references and their respective metadata from the extracted text.

3) LAYOUT-BASED PIPELINE
Layout-based pipeline deals with the case when a given markup document is in HTML format and has an unknown tags hierarchy. In this case, we will convert the HTML document into a PDF document and process it as a scanned PDF where it is simultaneously processed using text and layout-based bibliographic reference extraction pipelines.

D. INTERFACES AND OUTPUT
In this section, we will discuss different interfaces and outputs sample our of the proposed system. Our system provides a web-based friendly interface where one interface  for uploading and configuring files for processing while the other interface is responsible for displaying the results from layout-based detection. Additionally, an interface also lists all submitted processing tasks along with their output.

1) INPUT INTERFACE
The input interface of our system is shown in Fig 10. User can upload any file type with extension PDF, JPG, PNG, TIF, TXT, HTML and XML. In the first step, the user selects the desired file type for processing. Once the file type is selected, all available relevant pipelines are revealed. After selecting the desired processing pipeline, the user is asked to upload the desired file. Additionally, users can check an additional option on whether or not to add dummy text before the extracted text. For this purpose ''Append Dummy Text'' flag must be enabled. During the evaluation of ParsCit [18] we found out that appending dummy text to the start of the references text yields better results. Once all settings are done the user can trigger the processing phase by pressing Process File button.

2) OUTPUT VISUALIZING INTERFACE
Output visualizing interface is another important interface of our system where we can visualize the results of all documents processed as scanned documents. Fig 11 shows the output visualizing interface of our system containing all the images/scanned PDF documents already process. The detected references from both layout and text-based

3) TASKS STATUS INTERFACE
This interface provides a list of all submitted processing tasks along with their history. Additionally, it also shows their current status whether a task is currently in the queue or is already processed. Fig 12 shows the screenshot of the interface. Link to the XML output of each processed file is also available in front of each filename. It is to be noted that user access is protected with logins and sessions, therefore a user will only be able to view their processing tasks.

4) XML OUTPUT
Our system combines the output from all pipelines for the respective file type and returns it in a single XML file where results from each model are differentiated using two custom XML attributes. Fig 13 and 14 show output samples from layout and text-based model respectively. The custom attributes are added to identify the source of the output. First VOLUME 8, 2020  attribute is detector which refers to the approach used to detected references i.e. image-based or ParsCit. The second attribute is namer which refers to the approach used to extract reference metadata i.e. author names, title, publisher, etc from raw reference string. The possible values for namer are either ParsCit or Grobid.

E. BRExSys OVERVIEW
There are different pipelines in the BRExSys framework, where the pipeline with an ensemble of Text and layout-based methods is the largest. There are several phases in the ensemble pipeline. The pre-processing phase takes ≈ 4.35 seconds, followed by reference detection from the image takes ≈ 2.79 seconds which is further followed by the extraction of detections takes ≈ 1.95 seconds. OCR phase takes ≈ 3.63 seconds followed by the most expensive string segmentation phase by image-based approaches which takes ≈ 8.65 seconds. Lastly, compiling both layout and text-based results in an XML file and drawing results on the input image takes ≈ 3.88 and ≈ 1.59 seconds. It is worth mentioning that due to limited resources all these different services were mostly running on a single core which contributed towards using more execution time. BRExSys was tested on a system with the following hardware specifications:  Fig. 15 shows the output of all artificially created hypothetical cases after processing them through the ensemble pipeline of layout and text-based models. The output of the original image in Fig. 15a serves as the baseline, where all references are perfectly detected. The detections of the ensemble, layout-based, and text-based models are represented in green, yellow, and blue colors respectively. Fig. 15b simulates the output of BRExSys for an old document. The level of noise in the old document affected the output of the system, where most of the references were successfully detected by the layout-based approach only missing the top three references. Similarly, the text-based model also detected most of the references while missing the top three references. However, in the case of a text-based model, some of the references are merged and detected as one reference. Fig. 15c and Fig. 15d simulate the example of dim and tinted document images respectively with different noise levels. However, in both cases BRExSys successfully detected all references suggesting that only very high levels of noise may affect the output of the system.

VII. CONCLUSION
In this paper, we presented a novel layout driven reference detection approach called ''DeepBiRD'' which exploits human intuition and visual cues to effectively detect references without taking textual features into account. We also presented a dataset for image-based bibliographic reference detection which is publicly available. Later in meticulous evaluation, we pushed the boundaries of automatic reference detection and set a new state-of-the-art by a significant margin. The evaluation results suggested that the DeepBiRD is an effective, generic, and robust approach for the problem of automatic reference detection from document images. Lastly, we presented a highly customizable framework called for automatic reference detection which employs models from different modalities i.e. image and text-based. In future work, we are aiming for improving this system to make it more robust for very noisy document scans like in Fig. 15b. Additionally, we intend to include the functionality where the references page is automatically detected from a scientific publication before the extraction of bibliographic references.
ANDREAS DENGEL received the Diploma degree in computer science from the University of Kaiserslautern and the Ph.D. degree from the University of Stuttgart. In 1993, he became a Professor with the Department of Computer Science, University of Kaiserslautern, where he currently holds the Chair of knowledge-based systems. Since 2009, he has been a Professor (Kyakuin) with the Department of Computer Science and Information Systems, Osaka Prefecture University. He was with IBM, Siemens, and Xerox PARC. He is currently the Scientific Director of the German Research Center for Artificial Intelligence, DFKI GmbH, Kaiserslautern. He has authored more than 300 peer-reviewed scientific publications and supervised more than 170 Ph.D. and master's theses. His main scientific emphases are in the areas of pattern recognition, document understanding, information retrieval, multimedia mining, semantic technologies, and social media. He is a member of several international advisory boards, chaired major international conferences, and founded several successful start-up companies. He has co-edited international computer science journals and written or edited 12 books. He is an IAPR Fellow. He received prominent international awards.
SHERAZ AHMED received the master's and Ph.D. degrees in computer science from the Technische Universität Kaiserslautern, Germany, under the supervision of Dr. H. C. Andreas Dengel and Dr. Habil Marcus Liwicki. His Ph.D. thesis was on generic methods for information segmentation in document images. He has primarily worked on the development of various systems for information segmentation in document images. He is currently a Senior Researcher with the German Research Center for Artificial Intelligence, DFKI GmbH, Kaiserslautern, where he is leading the area of XAI in time series and genome analysis. He has more than 50 publications on the said and related topics, including several journal articles and two book chapters. His research interests include document understanding, explainable AI, pattern recognition, anomaly detection, genome analysis, and natural language processing. He is a frequent Reviewer of various journals and conferences, including Patter Recognition Letters, Neural Computing and Applications, IJDAR, ICDAR, ICFHR, and DAS.