Conferences >2016 IEEE International Confe...

Incremental structural model for extracting relevant tokens of entity

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

This paper describes a method for extracting relevant tokens of entity from semi-structured administrative documents. This method is used for mislabeling correction by em...Show More

Metadata

Abstract:

This paper describes a method for extracting relevant tokens of entity from semi-structured administrative documents. This method is used for mislabeling correction by employing the entity tokens physically close in a document. Firstly, the entities are labeled. Secondly, each entity is modeled by a tokens structure graph in which the nodes represent the tokens and the arcs represent the distances. A clustering algorithm is then applied to incrementally concatenate the relevant tokens of entities and ignore the noisy parts. The obtained results with a dataset of real invoices are reported in experimental section.

Published in: 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

Date of Conference: 09-12 October 2016

Date Added to IEEE Xplore: 09 February 2017

ISBN Information:

DOI: 10.1109/SMC.2016.7844767

Conference Location: Budapest, Hungary

No metrics found for this document.

Contents

I. Introduction

During the past two decades, technologies for automating tasks related to Document Image Analysis (DIA) are significantly evolved. The DIA is a basic technology for document understanding. In accordance with [1], it “refers to the field that is concerned with logical and semantic analysis of documents to extract human understandable information and codify it into machine-readable form”. Applied to administrative documents, the automatic data capture represents the extraction of relevant entities from semi-structured document images with their different types (purchase orders, credits items, forms, invoices, etc). If we are interested in the specific case of processing arrival invoices, the aim of the task is that given an incoming image from a known supplier to be able to extract all the relevant entities. According to [1], the relevant entities are generally classified into two broad categories: i) Header data which represent the data on the part invoicing such as invoice number, invoice date, due date, amount due, etc. ii) Table or data position which describe the details of invoice line items. In this paper, we focus in the extracting header data out of invoices. The latter have the same structure in the various samples. But, with the diversity of their content, we are dealing with an intriguing data. Its segmentation imposes particular requirements due to the incoherence, disposition of data and heterogeneous layouts.

Usage

Select a Year

View as

Total usage sinceFeb 2017:49

Year Total:1

Data is updated monthly. Usage includes PDF downloads and HTML views.

Citations

Crossref^®

Scopus^®

Search for
Citations in
Google Scholar^®

References is not available for this document.

Incremental structural model for extracting relevant tokens of entity

Abstract:

Metadata

Abstract:

I. Introduction

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Incremental structural model for extracting relevant tokens of entity

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?