An Overview of Data Extraction From Invoices

This paper provides a comprehensive overview of the process for information retrieval from invoices. Invoices serve as proof of purchase and contain important information, including the date, description, quantity, and the price of goods or services, as well as the terms of payment. Companies must process invoices quickly and accurately to maintain proper financial records. To automate this workflow, commercial systems have been developed. Despite the complexity involved, realizing automated processing of invoices necessitates the harmonious integration of a wide range of techniques and methods. While several surveys have shed light on different aspects of this workflow, our objective in this paper is to present a synthetic view of the process and emphasize the most pertinent challenges. We discuss the digitalization of invoices and the use of natural language processing techniques to extract relevant information. We also review machine learning and deep learning techniques that are widely used to handle the variability of layouts, minimize end-user tasks, and train and adapt to new contexts. The purpose of this overview is not to evaluate various systems and algorithms, but rather to propose a survey that reviews a wide scope of techniques for different data extraction tasks, addressing both information extraction and structure recognition for invoice processing. Specifically, we focus on table processing, paying particular attention to graph-based approaches.


I. INTRODUCTION
Invoices are crucial documents for companies as they serve as proof of purchase and are necessary for accounting and tax purposes.They are created by the seller and sent to the buyer to request payment for goods or services.Invoices typically contain essential information such as the purchase date, the description of goods or services, the quantity and price, and the payment terms.Companies need to process invoices promptly and accurately to maintain proper financial records and avoid potential payment delays.Digitizing invoices can help streamline the process and reduce the risk of errors.Paper invoices can be converted into a digital format, and automated systems can extract critical information like invoice numbers, amounts, and dates.This approach can speed up processing time and improve accuracy.Furthermore, digital invoices can be easily stored and accessed through document management systems, making it simpler to keep track of them and retrieve them when needed.
The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Olague .
The process of automated invoice processing requires the handling of several document characteristics, such as varying formats and layouts of invoices, differences in language and terminology, and errors or inaccuracies in the data [31].This can present challenges, but with advanced techniques such as machine learning and deep learning, the process can be automated and made more accurate to accomplish the following objectives: • effectively handle the variability of layouts: due to the lack of a global standard, invoices often exhibit significantly different formatting.Naturally, the required legal information varies from country to country, and furthermore, it can be arranged in various ways within the document.Hence, it is crucial to have labeling and typing techniques in place to isolate the key elements of an invoice.
• train and rapidly adapt to new contexts: in a practical scenario, companies often lack a substantial corpus of invoices that are properly labeled for learning or testing purposes.However, for small companies, the invoices they handle are typically specific since they originate from a relatively limited number of customers and suppliers.Consequently, it should be feasible to customize a system effortlessly for a particular situation, • minimize the end-user task: while some systems rely on predefined invoice presentation styles, modifying these layouts typically requires extensive user interaction.
Although it is important to engage the user in formulating their needs and specifying the desired information and management rules, it is essential to minimize the laborious manual tasks involved in system tuning, • efficiently detect and extract tables from the invoices: tables play a crucial role in invoices, primarily used to present accounting information.However, their formatting can vary significantly, and in some cases, they may only be suggested without explicit graphic delimiters.Consequently, the detection of tables within invoices represents a significant challenge for automated systems, leveraging the distinctive characteristics of invoices compared to more generalized documents that rely on headings or predefined elements.Automated processing of documents requires dedicated approaches based on the targeted domain.For instance, legal texts require specific techniques [17], [42].The analysis of administrative documents, including invoices, has been an active area of research for many years [13].The task is complex because invoices can come in various formats and contain a wide range of information such as invoice numbers, amounts, dates, and payment terms [31].The lack of structure in documents poses a real challenge for companies [12].To address this complexity, various techniques have been developed, such as Optical Character Recognition (OCR) [126] for digitizing paper invoices and natural language processing (NLP) techniques for extracting relevant information from the text.Neural networks are also frequently used for document classification tasks [137].
Commercial systems have been developed by companies like ITESOFT. 1 and ABBYY2 [122] to automate the processing of invoices.These systems use a combination of OCR, NLP, and machine learning techniques to extract information from invoices and process them automatically.By integrating with the company's existing systems, such as accounting and enterprise resource planning (ERP) systems, these systems streamline the invoice processing workflow into a global electronic document management system (EDMS) [63].Recent advances have led to the development of other end-to-end solutions for invoices [6].
Processing invoices requires complex administrative procedures and involves different departments such as accounting, logistics, and supply chain.To ensure efficiency and accuracy, specific workflows are often used [56].These workflows typically involve multiple steps, such as document digitization, information extraction, and data validation, as well as security considerations [97].Since invoices can take on various forms, statistical learning methods have been used to detect their possible classes [128].
The step of digitizing documents involves utilizing OCR technology to convert paper invoices into a digital format, allowing them to be processed and stored electronically with ease.Next comes the information extraction phase, which entails identifying the various identifiers such as types, amounts, dates, and other crucial details from the invoices.To achieve this, natural language processing (NLP) techniques, such as named entity recognition (NER), are typically employed, which aids in recognizing and extracting specific information from the text [50], [52].
Even if outside the scope of this overview, it is worth noting that classification techniques have been proposed for managing sets of invoices and categorizing financial transactions based on their economic nature [9], [131].Machine learning can also be used to forecast financial data [55] related to invoicing, and time series tools such as [141], [142], and [140] are particularly useful for this purpose.
There have been many proposed solutions for managing information contained in scanned invoices, and most of these solutions are based on machine learning techniques, which have seen recent advances [50], [102].In general, probabilistic and statistical approaches seem to be a natural way of understanding documents [88].The first challenge in this field was identifying invoices from a set of documents [71], and models have been proposed to streamline this process [22].
Once invoices have been correctly scanned and identified, the next challenge is to extract relevant information from them.Labeling techniques can be applied using rules [33], but recent research has focused on using neural networks (NN) for named entity recognition (NER) tasks [73], [75].This is because invoices often contain text sequences that are vastly different from natural language, and specific information extraction methods have been proposed to consider the specific structures in these documents.For example, [31] uses a star graph to consider the neighborhood of a text token, allowing for the context of a token to be taken into account when extracting information.This is a powerful method as it allows for meaningful information to be extracted from the document.
Several surveys provide an overview of general processing techniques for image documents, such as OCR techniques [59], text detection techniques [16], [146], NER approaches [75], [95], [144], and table processing [30], [38], [64].However, few papers provide general considerations for invoice processing.One such paper is [52], which does not cover table extraction.In [6], a very interesting end-to-end system is proposed for processing invoices, including the different above-mentioned steps.Choices are made to select relevant techniques and the resulting system focuses on key fields extraction.From these considerations, our motivation is to offer a more comprehensive overview of available methods that practitioners can use to design end-to-end solutions for invoice processing.Note that, we also pay particular attention to recent approaches based on graph representations.Please let us mention that our study is rooted in a practical experience, underpinned by the effective implementation of an electronic document management system in partnership with a company.
This overview aims to examine data extraction in the context of automated invoice processing.In Section II, we provide a comprehensive description of an invoice to highlight the critical data and structures that require attention.In our main section, Section III, we discuss the different components necessary for extracting this data.These include the digitization of the invoice using OCR (Section III-A), the development of a data extraction process (Section III-B), which involves recognizing specific entities (Section III-C) and identifying tables (Section III-D).Section III-E explores how geographical information can be utilized, with a particular focus on the use of graph-based representations.
Since such a survey involve numerous references, we propose an appendix with bibliographic tables that would help the reader to quickly identify the cited references according to the above-mentioned organization of the sections.

II. INVOICE MODELING
Defining a suitable representation of an invoice is an important step for clearly understanding its specifications.
An invoice can include inputting data such as the invoice number, date, and amounts, as well as assigning it to a specific customer or project.In [22], a semantic network was used to describe the invoice domain by different levels of abstraction.Before going on through invoice processing techniques, we propose here a model that better focuses on relevant extraction tasks that are expected to be handled by an invoice processing application.
We chose to initially limit the scope of invoice extraction.Figure 1 illustrates a basic sample of an invoice, emphasizing key information sought by automated document processing tools.The extraction of specific fields, such as the invoice date (highlighted in the purple box), supplier address (in the orange box), and organizational providers (within the cyan box), is crucial.This survey places particular emphasis on table extraction, as indicated by data enclosed in blue and red boxes.Additionally, it is worth noting that the invoice contains other pertinent information that may be valuable for Named Entity Recognition (NER) processes, including the identification of both the sender and receiver.Figure 2 provides a comprehensive view of the typical content of a invoice by means of an UML class diagram.
Different types of information must be highlighted such as addresses, tables, dates, and actors (organizations or individuals identified on the invoice).This selected information seems coherent with the analysis of multiple invoice models and the usual requirements of the companies.One may identify 6 groups of data: • Actors: individuals or companies involved in the invoice, such as a customer or a supplier.
• Independent fields: fields whose value is not linked to one of the other following fields and that often represent essential data for the invoice.
• Information on the document: information specific to the management of the document, such as its name or identifier in the file system, the dates of creation and processing of the document -all the data that are not extracted from the document but that come from its processing.
• Addresses: addresses contained in the document, with if possible precision on their types, billing address, delivery, or sender for example • Tables: data tables are essential in invoices.They often include several lines of invoiced items, prices, quantities. . .
• Date: the set of dates, specific to the invoice processes such as the date of the edition of the invoice, the date of payment or of delivery.Among these data, tables are considered complex to extract in this model because they often contain a large amount of structured data that needs to be parsed and understood.Companies need to perform verification operations on the table data, such as verifying VAT amounts and rates, or ensuring that the sum of the table lines matches the invoice amount.Efficient methods for extracting and analyzing table data are crucial due to the time-consuming and error-prone nature of the process.

III. INVOICE PROCESSING
As mentioned in the introduction, automated invoice processing requires a complete chain of software tools to automate the tasks involved in processing invoices.Hence, we could consider the following key features: 1) Optical Character Recognition (OCR): OCR is used to extract data from scanned or PDF invoices, making them searchable and easily readable by the system.2) Machine Learning (ML): ML algorithms are widely used to classify and extract data from invoices, such as vendor information, invoice numbers, and amount.They are also intended to be able to extract structured information such as tables.3) Workflow Automation: the system automatically route invoices for approval, flagging any discrepancies or errors for manual review.4) Integration with ERP: automated invoice processing systems can integrate with enterprise resource planning (ERP) systems, allowing for seamless data transfer and real-time visibility into the invoice process.5) Real-time Analytics: automated invoice processing systems can provide real-time analytics and reporting on invoice data, allowing businesses to track and analyze their spending.This is strongly related to business intelligence modules.6) Compliance and Security: one may want to check compliance with tax regulations, and protect sensitive data through security measures such as encryption and secure data storage.In this overview we restrict our scope to information extraction, considering raw scanned documents.Hence we restrict ourselves to the first two points of the abovementioned features.

A. OPTICAL CHARACTER RECOGNITION
OCR systems have a long history, starting with early mechanical devices that were developed in the 1950s, such as GISMO (built by Sheppard in 1951).During the 1960s and 1970s, not much research was done on OCR due to the errors and slow recognition speed of the early systems [72].However, during the past 40 years, there has been substantial research on OCR which has led to the development of document image analysis, multilingual, handwritten, and omni-font OCRs.Nevertheless, OCR technology is still far from matching human reading abilities and current research focuses on improving accuracy and speed of diverse document styles and languages, including complex languages.
Let us mention several state-of-the-art reviews [37], [93] that were already synthesizing the work in the early 90s.The seminal roots of OCR can be explored by reading the stateof-the-art of Mantas [83].On the other hand, a good practical starting point for OCR can be accessed through the work of Breue [18], which presents an open-source OCR solution.A recent state of the art in OCR has been published in 2017 by Islam et al. [59].
Hence, OCR is a crucial discipline in image interpretation with highly important potential applications.A major problem was handwritten character recognition [89], including the need for a database.Note that important conferences were focusing on OCR since the 90s, e.g.ICDAR [1] with dedicated workshops [49].Neural networks have then considered to overcome the previous limitations.In [28], the use of projection profile features coupled with a back-propagation neural network classifier has proven highly effective.Nowadays, neural networks are widely used in OCR technologies.Let us quote some recent works: in [96] the author consider a significantly extensive Urdu corpus ideally suited for applications involving deep learning techniques, [62] introduced end-to-end learning methods for recognizing arithmetic expressions combining deep a convolutional neural network and convolutional recurrent neural network, in [66] the authors propose an exploration of character recognition, encompassing both monolingual and multilingual contexts, utilizing both deep and shallow architectural approaches.
Among the impressive number of works related to OCR, let us mention the work of Mithe et al. [92] that presents a solution using an OCR solution to extract text and then send it to a voice synthesizer.The main objective behind this solution is to produce a solution that transforms an image into a speech on the contained text in the picture.This article proves that the processing of an image makes it possible to obtain fully structured information.
Of course, it is also very important to clearly assess the performance of OCR using suitable measures and available benchmark sets [98].Let us note here that image processing techniques can be used to get better initial documents, even before applying OCR.Morphological operations, such as dilation, erosion, and opening, are commonly used in image processing to remove noise, blur, and skewness from document images.These techniques have been applied to prepare images for OCR and to locate text-containing parts in an image [146], for instance using OpenCV [44].
Back to our structured information extraction concern, a dedicated challenge has been recently proposed by Huang et al. [58] at the ICDAR2019 conference.The prize for the best paper was awarded to Zhong et al. [156] which offers a solution based on neural networks for the recognition of certain entities related to the formatting of documents.
In recent times, there has been ongoing research in the field of OCR.Let us mention a first work [10] that specifically focuses on the application of OCR for the recognition of written texts within a medical context.A promising development in OCR techniques aligns with the progress in deep learning, as exemplified by the work of Li et al. [77].In this work, the authors have adapted the transformer architecture to address OCR challenges and have presented a comprehensive benchmark featuring many contemporary techniques.This reflects the dynamic evolution of OCR methodologies, where advancements in deep learning play a pivotal role.

B. DATA EXTRACTION
Once the OCR has been applied, we are generally left with a set of PDF documents that are expected to be searchable and exploitable.Let us first begin with a general consideration of possible data extraction at this stage.At first glance, we may consider the visual aspect of the document and the relative positions of the information that it contains.
The work of Taylor et al. [132] presents an overview of the problem of document extraction from scanned documents.This article highlights the problems of alignment of the text.It also highlights that only part of the information is relevant to extract.
The global layout of the document has to be taken into account [7].Ahmad and Man [2] use the concept of unstructured, semi-structured, or structured documents.The work of Yao et al. [145] on the relationships between entities, which is also unlabeled, also seems very relevant.Sun et al. [130] present a solution for orienting documents according to a specific entity (QR Code in the article).These methods address two common challenges in data extraction: document orientation and scale.The invoices, which are in the form of images, are first preprocessed to remove any unnecessary background and to correct the angle of the invoice.Then, the region containing the desired information on the invoice is identified using template matching.Another system (BINYAS) [16] performs document layout analysis for document image processing.This system uses connected components and pixel analysis for classifying elements such as paragraphs, graphics, images, and tables in the document.In [11] the authors propose a dataset for unstructured invoice documents that covers a wide range of layouts, which is designed to generalize key field extraction tasks for unstructured documents.The dataset is evaluated using various feature extraction techniques as well as Artificial Intelligence methods.
As already mentioned, tabular content extraction from PDF documents is of great importance, in particular for benefiting from available open-source document repositories [30].The extraction and processing of data from PDF files have indeed always been studied [81].Data in tables is often displayed in a tabular format.Although tables may appear simple, extracting and processing them from PDFs can be difficult and require complex computational methods [48].The purpose is often to produce new formats from initial PDFs such as XML files [112].Note that PDFs do not typically record the structure of their graphical objects in their description, although it could be done.
Of course, visual separators are important for identifying tables in documents as they reveal the table structures [41].Actually, when tables include visible lines that can be extracted from the document, considering the maximum independent set of rectangles (MISR) problem seems relevant [24].MISR consists of finding in a set of rectangles the smallest set of rectangles with no intersection.Unfortunately, many tables miss lines to separate some columns or rows and some techniques do not apply in these cases.Yildiz et al [148] present approaches based on line intervals and columns to identify the entities corresponding to tables' cells.Note that table extraction will be detailed in Section III-D.
Deep learning techniques are now widely used to identify and extract tables in PDF documents [46], [151].This aspect will be detailed later.Note that some work uses APIs such as PDFminer to transform PDF into XML and perform supervised learning on XML [103].

C. ADDRESSING SPECIFIC INFORMATION EXTRACTION: NAMED ENTITY RECOGNITION
In the scope of this study, we are not concerned with general document processing but with invoices that are restricted to a specific domain, whose terms and concepts are known.Hence we are concerned by the semantics of the documents.The analysis of invoices is hence related to Natural Language Processing (NLP) and more specifically to Named Entity Recognition (NER) (see [95], [144] for dedicated surveys).
The problem of named entity recognition (NER) was presented by Marsh and Perzanowski at the MUC conference [85].NER involves labeling a text by associating each character string with a specific category, such as a person, location, organization, temporality, amount, or percentage.This problem is also referred to as entity labeling or entity extraction.Research intensifies then on this purpose.During CoNLL-2003 [134] the focus was put on language-independent named entity recognition.The challenge concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.During same period, the ACE program's goal [35]) was to advance technology for automatically extracting information from human language data.This includes identifying mentioned 19876 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
entities, determining the relationships between these entities as expressed in the text, and recognizing the events in which these entities are involved.The program encompasses various data sources.
At this time the NER was restricted to the names of people, locations, and organizations, and sometimes to some other proper names, which does not cover all the possible types expected in an invoice.
The specific set of labels used in NER depends on the data and the task at hand.NER is, of course, strongly dependent on the application domain (e.g., [106], [153]).Some researchers limit themselves to the 6 initial categories (person, location, organization, temporality, amount, and percentage) and believe that these labels are sufficient for all NER tasks.However, other researchers argue that specific labels may be necessary to effectively solve specific NER tasks [20].The number of labels used can vary depending on the complexity of the data and the specific requirements of the task.Therefore, the choice of labels is often a trade-off between the need for specific information and the complexity of the model.Let us particularly mention the works of Alfonseca et al. [3] and R.Evans [40] that use the notion of ''open domain''.Recently, data sets have been made available for NER related to invoices [11].From a practical point of view, Mikolov et al. [90] demonstrate the benefit of using vector representation of words and also that it is possible to train a model of neural networks on a large training set, including a large number of sentences with approximately one billion words and a vocabulary of more than one million different words.A month later, Mikolov et al. [91] considered a distributed representation of words and prove that, by adding certain vectors of words, the learning process allows one to learn the meaning of the words.The linguist Scharolta Katharina Siencnik [125] attempts then to demonstrate the possible application of these algorithms to named entity recognition.
While state-of-the-art named entity recognition systems relied heavily on hand-crafted features and domain-specific knowledge, new neural architectures for NER were proposed [27], [73].These architectures aim to improve performance by leveraging the strengths of neural networks, such as their ability to learn useful features from data, while still addressing some of the limitations of previous methods.Convolutional neural networks (CNN) [79] have been then considered with NER problems [138], [150] as well as bidirectional networks [4].Let us mention work on the identification of depression according to the answers of the patient in an interview [114] as well as the work of He et al. [54] to establish distant dependencies between the entity terms via the processing of CNN.
ELMo is a language model that was developed by Matthew E. Petters and his team [104].Unlike traditional word embeddings that represent words as fixed vectors, ELMo utilizes the context in which words appear to generate more dynamic and informative word embeddings.The model is semi-bidirectional, which means it takes into account both the preceding and succeeding words in a sentence to better understand the meaning of the word it is trying to represent.
ELMo's innovative approach to word embeddings quickly gained attention from researchers in the natural language processing (NLP) community.Dogan et al. [36] applied ELMo's neural network architecture to tackle Named Entity Recognition (NER) problems, which involve identifying and classifying entities in text such as names, dates, and locations.While ELMo showed promising results, it had a limitation that it could not be effectively fine-tuned with other models using a ''masked language model''.
To address this shortcoming, Devlin et al. proposed BERT [34], a bidirectional model also based on ELMo, which has since become one of the most widely used pre-training models for NLP tasks.BERT uses a ''masked language model'' that allows it to refine its representations using an unsupervised pre-training method.This makes it possible for BERT to generate high-quality word embeddings that can be fine-tuned with other NLP models to achieve state-of-the-art performance on a wide range of language tasks.
Ali Safaya et al. [115] demonstrate the possible association between CNN and BERT and study its efficiency.This work focuses on BERT associated with Arabic, Turkish, and Greek languages, which presents a more structured construction than some languages.This study achieves better efficiency in the recognition of hateful content for these languages.
GPT models have become essential in natural language processing (NLP) due to their ability to be fine-tuned for specific NLP tasks.Radford's work on language model transformers, particularly the GPT model [110], has revolutionized the field of NLP.Unlike bi-directional models like BERT and ELMo, GPT is a unidirectional model where word embeddings are only enhanced in one direction, typically from left to right.This unidirectional architecture makes GPT particularly useful in language prediction tasks, where the model predicts the next word in a sentence based on the preceding words.
Getting back closer to our main concern, Francis et al. [43] present a solution for extracting data from financial or medical documents using a neural network trained for named entity recognition, which evaluates the efficiency of a character-based model or on a word.One also has to consider general language processing.For instance, the work of Suárez et al. [129] on the state of the art of named entity recognition for the French language can be useful for dealing with French invoices.Hamdi et al. [52] present tools to improve the learning of invoice-specific labeling by reducing the cost of time and human intervention.
To ensure better explainability, rule-based approaches are useful alternative techniques for achieving NER [26].Shreeshiv et al. [102] address the extraction of key parameters of the invoice (KPIE), by proposing a rule-based approach and an approach based on neural networks to recognize these parameters of the invoice.Declarative approaches based on constraint solving should also be considered as promising research direction [5].
Practical solutions are available for NER, such as that of Nanonets. 3com and ABBYY.A well-documented example explains the use of BERT in the case of a NER [34].
In summary, there are two main approaches.The historical rules-based approach tends to be inspired by the rules of traditional grammar for labeling words in the context of the text.This approach is very efficient on specific domains because the writing of the rules is often very oriented towards the desired domain to avoid ambiguity.Nevertheless, this specialization leads to processing difficulties for the new context, not defined during the implementations.It is also necessary to rework the model to extend its capacity.This step often requires the intervention of an expert.
The neural network approach to label the entities of our document seems interesting to avoid spending too much time defining the labeling rules.This method better manages the new domains and we can more easily set up automation of the relearning for the new concepts treated.Nevertheless, NN requires huge computational resources and training corpora to be efficient.
In Figure 3 we propose an empirical evaluation of NER systems according to the state of the art, the statements of the various specialists in this field, and the needs encountered in companies.This evaluation is therefore subjective.

D. FOCUS ON TABLE EXTRACTION
Examining more precisely invoices leads to consider that most of them include tables as a main structural character.Hence, table detection within invoices appears as an important processing task [121].Table processing is indeed an old challenge (the 2004 survey [149] propose already an overview of the field) but these challenges are still active [45].
Understanding information embedded into tables involves three steps as quoted in [61]: detecting the table boundaries, identifying the structure of the table including rows, columns, and cell positions, and recognizing the contents of the table (tokens of information that are expected to be presented in 3 https://www.nanonetsa more readable format).The layout is an important aspect [69].Techniques used for detection include object detection models [23] like Faster-RCNN (Region Based Convolutional Neural Networks) and Mask-RCNN [107] and NLP-based methods that incorporate both textual and visual features [57].
Note that TableBank [76] includes a new image-based table detection and recognition dataset.PubLayNet [156] can accurately recognize the layout of scientific articles after training on over one million PDF articles.LayoutLMv3 [57] is pre-trained with a word-patch alignment objective to improve cross-modal alignment.This allows the model to predict whether the image patch associated with a text word has been masked.
Deep learning techniques are now widely used for achieving table structure recognition.Recently, Kavasidis et al [67] introduce a fully-convolutional neural network that utilizes saliency-based techniques for multi-scale reasoning with visual cues.They also incorporate a fully-connected conditional random field to precisely locate tables and charts within digital or digitized documents.A common approach consists of using a bi-directional RNN with Gated Recurrent Units (GRUs) to process image data [68].The pre-processing step is used to form the image data so that it can be fed into the network.The bi-directional RNN with GRUs is then used to analyze the image data and extract features.Finally, a fully connected layer with a softmax activation function is used to classify the image based on the features extracted by the RNN.Gilani et al. [47] introduced an approach based on deep learning to detect tables.Our method begins by pre-processing document images, which are then input into a Region Proposal Network (RPN), followed by a fully connected neural network to identify tables.Their method demonstrates remarkable precision when applied to document images with diverse layouts, encompassing documents, research papers, and magazines.Vine et al. [136] introduce a two-step approach including a generative adversarial network (GAN) and a genetic algorithm to optimize a distance measure between candidate table structures.Another two-step process that uses cell detection and interaction modules to recognize the structure of a table is proposed in [111].The cell detection module is used to locate and identify individual cells in the table image.The interaction module then predicts the associations between the detected cells, such as their row and column associations.This approach can be useful for determining the overall structure of a table, including the number of rows and columns, as well as the relationships between cells within the table.Convolutional networks have, of course, been explored [67], [124], with Split and Merge models [133].In [99], the authors consider also explainability as an issue in an NN.Global end-to-end solutions are now available TableNet [100], DeepDeSRT [119] PubTabNet [155] or GTE [154].Dedicated benchmarks repository have been proposed to evaluate these methods: Tablebank [76] (417K high quality labeled tables) and even a novel dataset derived from the RVL-CDIP invoice data [113].
Table detection may also rely on more specific knowledge.In [139], the authors propose a system for automatically generating ground truth data for training table detection algorithms.We found in the literature important works on layouts, for example in [105], David P. al use ''Conditional Random Fields'' (CRF) to compose different layouts of a table that can sometimes overlap and may be misinterpreted by other modeling languages.Tools such as TableSeer [80] searches for forms that can correspond to tables to extract them and be able to execute queries on their contents.
The specific structures of invoices lead to considering the geographical organization of the document and graph-based models are thus relevant [121].Recent work [65] proposes an approach to detect the general frame of a table and extract its content.Focusing on more specific tables, their characteristics are also intended to help these tasks, such as headers [120].Rule-based systems, which were seminal table extraction techniques, may also be relevant [123].
Graph-based approaches also seem to be a natural way to handle tables.In [116] the authors use graph mining for extracting tables using key fields.Hence, Graph Neural Networks (GNN) [118] appears as natural to handle graph-based knowledge [147].Graph Neural Networks (GNNs) can indeed capture the local repeating structural information in invoice document tables [113].In [78], the authors propose a method based on GNN to mix position and text.Their algorithm also uses visual recognition to predict the right numbers of columns and lines.In [108] architecture that combines the benefits of convolutional neural networks for visual feature extraction and graph networks is introduced for dealing with the problem structure.Cell detection and cell logic are used to predict the location of the cells in [143].[152] presents a unified framework that utilizes a combination of vision, semantics, and relations for analyzing document layouts, supporting natural language processing and computer visionbased methods.Slightly different, LGPMA [109] employs a soft pyramid mask learning approach to recover table structure by analyzing both local and global feature maps.Additionally, it considers the location of empty cells during this process.

E. HANDLING GEOGRAPHIC INFORMATION IN THE INVOICES: POSSIBLE PERSPECTIVES FOR GRAPH-BASED REPRESENTATIONS
Since the layout of invoices is particularly relevant as described above, let us explore the modeling and the processing of geometric or geographic information, to discover links that cannot be handled by a purely semantic analysis of the document.For example, an invoice may contain a keyword and its expected associated value close to it.Let us review some methods for representing and exploring this structured data.For instance, Esser et al [39] try to extract templates from scanned documents.This section is devoted to methods that would not consider image processing or NN to handle the global layout of the document using a training process.We are merely interested in techniques based on representation models and associated solving techniques to process geometric data in a more frugal (without the need for a huge and costly training) and more declarative way.
A long time ago, Cesarini et al. [21] were already interested in the structural analysis of a document by trying to label areas.They consider that an invoice is a set of regions that can be identified using their relative geometrical position.
As mentioned in Section III-D, graph-based representation has been explored for handling the structures of the tables in documents.Therefore, we focus here on such representations and how they can be exploited to efficiently retrieve table structures and their content.Since the structure of a table may contain different levels, we argue that several levels of abstraction are needed to represent the geometrical structure of a table.Using models with geometric constraints and enabling their declarative handling has been explored in [19].An abstract model is linked to a graphic model and a refinement process is proposed.Geometric constraints [94] require dedicated constraint solvers according to targeted domains.In [117], we propose an approach based on hypergraph to handle table extraction.Hypergraphs [15] are classic extensions of graphs and enable more powerful models.Hence, after suitable modeling, one may consider table extraction in a document as an isomorphism problem in hypergraphs [14].The sub-isomorphism problem is NP-Complete [29] and its complexity has been refined according to parameters [86].Solvers, such as the Glasgow solver [87] are available to solve this problem as well as efficient algorithms [127] including recent quantum search algorithms [84].In a recent work [74], the author proposes to represent tables as planar graphs with cell regions as their faces.They generate junction confidence maps and line fields using heatmap regression networks.Their approach mixes deep NN and constrained optimization problems.

F. TURNING TO EFFICIENT SOLUTIONS FOR INDUSTRY
As a starting point, it might be worthwhile to delve into the intricacies of Extraction, Transformation, and Loading (ETL) processes [135], which form the backbone of operations within a data warehouse architecture, with the aim of acquiring data from diverse document sources, each characterized by its potential multimodal attributes.A critical dimension in this context is the recognition that data assimilation stems from a variety of document origins.The multifaceted nature of these documents underscores the complexity of the task at hand.Furthermore, automated document processing systems must exhibit the capability to update data at regular intervals, emphasizing the need for realtime adaptability.Following these lines, Figure 4   Some industry solutions offer a partial solution to the total ETL process.They are based on plugins designed for each information retrieval task.For instance, the Azure 4 solution developed by Microsoft offers numerous APIs for processing documents including OCR and NER.The ABBY solution is split into different programs : Flexicapture for OCR and FlexiLayout for extracting data from a document using templates.
Transformers maybe now used to provide end to end solutions and address various modalities related to document processing tasks, such as classification, question answering or NER [32], [70].The diverse nature of documents necessitates multimodal reasoning that encompasses various types of inputs [8].These inputs, including visual, textual, and layout elements, are found in a variety of document sources.These aspects may be considered for developing efficient invoices processing tools.

IV. CONCLUSION
In conclusion, invoices are crucial documents for companies as they serve as proof of purchase and are necessary for accounting and tax purposes.The processing of invoices can be time-consuming and prone to errors, but recent advances in technology have led to the development of systems that automate the process.These systems use a combination of OCR, NLP, and machine learning techniques to digitize paper invoices and extract relevant information.The processing of invoices involves different steps such as document digitization, information extraction, and data validation, and specific workflows are often used to ensure efficiency and accuracy.The challenge of processing invoices lies in handling the variability of layouts, language, and terminology, and the presence of errors or inaccuracies in the data.
In this survey, we have reviewed the essential components that must be taken into account when developing an automated invoice processing system.Our goal is to provide valuable insights to researchers and engineers striving to create end-to-end solutions, and in this pursuit, several critical factors demand careful consideration: • Document Quality: The quality of the documents input for processing plays a crucial role.Standard digitized 4 https://azure.microsoft.com/en-usexhibiting issues or containing handwritten sections, a more sophisticated image processing pipeline and highly efficient text recognition are imperative.Real-world financial documents, for instance, may feature handwritten notes from employees seeking reimbursements, making document quality a critical • Invoice Content: The nature of the invoice content is another crucial consideration.In cases where invoices consist of limited and concise information, without extensive descriptions or intricate commercial terms, employing simple Named Entity Recognition (NER) techniques based on a compact model, as exemplified in Figure 2, suffices.Conversely, for more complex scenarios, the integration of Natural Language Processing (NLP) techniques becomes essential to delve into the semantic nuances of scanned texts.
• Layout Diversity: The diversity of invoice layouts cannot be underestimated.When documents are associated with a finite number of suppliers or clients, rule-based techniques designed to match predefined layouts can be VOLUME 12, 2024 19881 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
harnessed.Moreover, these techniques may offer flexibility, allowing end-users to fine-tune the system to visually locate and extract key information from invoices.
• Annotated Data Sets: Machine learning techniques, while powerful, rely heavily on sizable and representative training datasets for optimal performance.As mentioned in this survey, rule-based approaches can often be generic enough to process invoices effectively without necessitating extensive supervised learning processes.
• Table Diversity and Quality: Tables within invoices represent a pivotal aspect of the processing pipeline.While basic tables can be detected using image processing and neural network-based algorithms, more complex scenarios emerge when tables are incomplete and exhibit considerable diversity, often due to variations in invoice layouts.In such cases, recent graph-based algorithms present a compelling and efficient alternative.By taking these facets into account, engineers can embark on the development of robust, efficient, and adaptable automated invoice processing systems that cater to a wide spectrum of real-world invoice scenarios.In this context, hybrid methods, combining both rule-based and neural network approaches.
In recent times, there has been a notable emergence of large language models (LLM).These models present promising prospects for document processing by integrating structural and semantic recognition to achieve effective extraction of information from both structured and semi-structured documents.

FIGURE 2 .
FIGURE 2. An UML model for invoices.

FIGURE 3 .
FIGURE 3. Advantages and disadvantages of NER methods according to the state of the art.
encapsulates the multifunctional essence of information extraction from invoices.It provides a visual representation of the intricate multitasking inherent in the information extraction workflow.

FIGURE 4 .
FIGURE 4. Different steps for in an ETL process.

TABLE 1 .
Summary of cited surveys.

TABLE 2 .
Summary of available datasets for document analysis and recognition.

TABLE 3 .
Summary of available datasets for table analysis.

TABLE 4 .
Summary of main cited works on OCR.
invoices can often be handled with relatively basic OCR systems.However, when dealing with documents 19880VOLUME 12, 2024Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 5 .
Summary of main cited works on data extraction.

TABLE 6 .
Summary of main cited works on NER.

TABLE 7 .
Summary of main cited works on table Extraction.