I. Introduction
During the past two decades, technologies for automating tasks related to Document Image Analysis (DIA) are significantly evolved. The DIA is a basic technology for document understanding. In accordance with [1], it “refers to the field that is concerned with logical and semantic analysis of documents to extract human understandable information and codify it into machine-readable form”. Applied to administrative documents, the automatic data capture represents the extraction of relevant entities from semi-structured document images with their different types (purchase orders, credits items, forms, invoices, etc). If we are interested in the specific case of processing arrival invoices, the aim of the task is that given an incoming image from a known supplier to be able to extract all the relevant entities. According to [1], the relevant entities are generally classified into two broad categories: i) Header data which represent the data on the part invoicing such as invoice number, invoice date, due date, amount due, etc. ii) Table or data position which describe the details of invoice line items. In this paper, we focus in the extracting header data out of invoices. The latter have the same structure in the various samples. But, with the diversity of their content, we are dealing with an intriguing data. Its segmentation imposes particular requirements due to the incoherence, disposition of data and heterogeneous layouts.