Loading [MathJax]/extensions/MathMenu.js
Incremental structural model for extracting relevant tokens of entity | IEEE Conference Publication | IEEE Xplore

Incremental structural model for extracting relevant tokens of entity


Abstract:

This paper describes a method for extracting relevant tokens of entity from semi-structured administrative documents. This method is used for mislabeling correction by em...Show More

Abstract:

This paper describes a method for extracting relevant tokens of entity from semi-structured administrative documents. This method is used for mislabeling correction by employing the entity tokens physically close in a document. Firstly, the entities are labeled. Secondly, each entity is modeled by a tokens structure graph in which the nodes represent the tokens and the arcs represent the distances. A clustering algorithm is then applied to incrementally concatenate the relevant tokens of entities and ignore the noisy parts. The obtained results with a dataset of real invoices are reported in experimental section.
Date of Conference: 09-12 October 2016
Date Added to IEEE Xplore: 09 February 2017
ISBN Information:
Conference Location: Budapest, Hungary
No metrics found for this document.

I. Introduction

During the past two decades, technologies for automating tasks related to Document Image Analysis (DIA) are significantly evolved. The DIA is a basic technology for document understanding. In accordance with [1], it “refers to the field that is concerned with logical and semantic analysis of documents to extract human understandable information and codify it into machine-readable form”. Applied to administrative documents, the automatic data capture represents the extraction of relevant entities from semi-structured document images with their different types (purchase orders, credits items, forms, invoices, etc). If we are interested in the specific case of processing arrival invoices, the aim of the task is that given an incoming image from a known supplier to be able to extract all the relevant entities. According to [1], the relevant entities are generally classified into two broad categories: i) Header data which represent the data on the part invoicing such as invoice number, invoice date, due date, amount due, etc. ii) Table or data position which describe the details of invoice line items. In this paper, we focus in the extracting header data out of invoices. The latter have the same structure in the various samples. But, with the diversity of their content, we are dealing with an intriguing data. Its segmentation imposes particular requirements due to the incoherence, disposition of data and heterogeneous layouts.

Usage
Select a Year
2024

View as

Total usage sinceFeb 2017:49
00.20.40.60.811.2JanFebMarAprMayJunJulAugSepOctNovDec000100000000
Year Total:1
Data is updated monthly. Usage includes PDF downloads and HTML views.
Contact IEEE to Subscribe

References

References is not available for this document.