Conferences >2023 36th SIBGRAPI Conference...

NBID Dataset: Towards Robust Information Extraction in Official Documents

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The Visual Document Understanding (VDU) task is of great interest for a variety of organizations, including banks, governments and schools, all of which would benefit fro...Show More

Metadata

Abstract:

The Visual Document Understanding (VDU) task is of great interest for a variety of organizations, including banks, governments and schools, all of which would benefit from reliable automatic information extraction from pictures of documents. However, due to the sensitive nature of the data, creating new datasets for official documents, such as identity cards and passports, proves to be very challenging as the data must first be safely anonymized and synthesized. Such a process requires the source images to be modified, which may impact performance on VDU models. In this paper, we propose a new dataset and the synthesizer used for its generation, both made publicly available. We also selected three state-of-the-art VDU models: PICK, StrucTexT, and DocFormer, for evaluation on the dataset, in order to study the impact of the synthetic data on performance. We trained the models using both synthetic-only and synthetic-plus-real data protocols and present the results for both datasets. Our synthesizing process is shown to benefit training when used as an addition to the real data.

Published in: 2023 36th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)

Date of Conference: 06-09 November 2023

Date Added to IEEE Xplore: 18 December 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/SIBGRAPI59091.2023.10347174

Conference Location: Rio Grande, Brazil

Contents

I. Introduction

Document Understanding is a broad area that involves many tasks involving the extraction of information from documents. Recent research has focused on building a robust representation in an unsupervised pre-training phase and using supervised fine-tuning of the model for downstream tasks [1], [2]. The pre-training is usually done with very large datasets, such as [3], which has millions of documents, a practice borrowed from NLP [4], [5].

References is not available for this document.

NBID Dataset: Towards Robust Information Extraction in Official Documents

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

NBID Dataset: Towards Robust Information Extraction in Official Documents

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?