NBID Dataset: Towards Robust Information Extraction in Official Documents | IEEE Conference Publication | IEEE Xplore

NBID Dataset: Towards Robust Information Extraction in Official Documents


Abstract:

The Visual Document Understanding (VDU) task is of great interest for a variety of organizations, including banks, governments and schools, all of which would benefit fro...Show More

Abstract:

The Visual Document Understanding (VDU) task is of great interest for a variety of organizations, including banks, governments and schools, all of which would benefit from reliable automatic information extraction from pictures of documents. However, due to the sensitive nature of the data, creating new datasets for official documents, such as identity cards and passports, proves to be very challenging as the data must first be safely anonymized and synthesized. Such a process requires the source images to be modified, which may impact performance on VDU models. In this paper, we propose a new dataset and the synthesizer used for its generation, both made publicly available. We also selected three state-of-the-art VDU models: PICK, StrucTexT, and DocFormer, for evaluation on the dataset, in order to study the impact of the synthetic data on performance. We trained the models using both synthetic-only and synthetic-plus-real data protocols and present the results for both datasets. Our synthesizing process is shown to benefit training when used as an addition to the real data.
Date of Conference: 06-09 November 2023
Date Added to IEEE Xplore: 18 December 2023
ISBN Information:

ISSN Information:

Conference Location: Rio Grande, Brazil

I. Introduction

Document Understanding is a broad area that involves many tasks involving the extraction of information from documents. Recent research has focused on building a robust representation in an unsupervised pre-training phase and using supervised fine-tuning of the model for downstream tasks [1], [2]. The pre-training is usually done with very large datasets, such as [3], which has millions of documents, a practice borrowed from NLP [4], [5].

Contact IEEE to Subscribe

References

References is not available for this document.