Abstract:
The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges...Show MoreMetadata
Abstract:
The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel bottom-up hierarchical wrapping approach that exploits fuzzy logic to handle the “uncertainty” which is intrinsic to the structure and presentation of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions that impose a target structure to groups of tokens containing the required information. Constraints on token groupings are formulated as fuzzy conditions, which are defined on spatial and content predicates of tokens. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document. The proposed approach has been implemented in a wrapper generation system that offers visual capabilities to assist the designer in specifying and evaluating a PDF wrapper. Experimental results have shown good accuracy and applicability of our system to PDF documents of various domains.
Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 23, Issue: 12, December 2011)

DEIS Department, University of Calabria, Rende, Cosenza, Italy
Sergio Flesca received the PhD degree in computer science from the University of Calabria. He was a research fellow at the Database and Artificial Intelligence (DBAI) group, Institute of Information Systems, Department of Computer Science, Vienna University of Technology, Austria. Currently, he is an associate professor at the University of Calabria. His research interests include databases, web and semistructured data ma...Show More
Sergio Flesca received the PhD degree in computer science from the University of Calabria. He was a research fellow at the Database and Artificial Intelligence (DBAI) group, Institute of Information Systems, Department of Computer Science, Vienna University of Technology, Austria. Currently, he is an associate professor at the University of Calabria. His research interests include databases, web and semistructured data ma...View more

ICAR Institute of National Research Council, Rende, Cosenza, Italy
Elio Masciari received the PhD degree in computer science engineering at the University of Calabria, Italy. Currently, he is a senior researcher at the ICAR institute of the Italian National Research Council. His research activity is mainly focused on techniques for mining structured and unstructured data, XML query languages, and data streams.
Elio Masciari received the PhD degree in computer science engineering at the University of Calabria, Italy. Currently, he is a senior researcher at the ICAR institute of the Italian National Research Council. His research activity is mainly focused on techniques for mining structured and unstructured data, XML query languages, and data streams.View more

DEIS Department, University of Calabria, Rende, Cosenza, Italy
Andrea Tagarelli graduated magna cum laude in computer engineering in 2001, and received the PhD degree in computer and systems engineering in 2006. He is an assistant professor of computer science at the University of Calabria, Italy. He was research fellow at the Department of Computer Science and Engineering, University of Minnesota at Minneapolis, in 2007. His research interests include topics in knowledge discovery a...Show More
Andrea Tagarelli graduated magna cum laude in computer engineering in 2001, and received the PhD degree in computer and systems engineering in 2006. He is an assistant professor of computer science at the University of Calabria, Italy. He was research fellow at the Department of Computer Science and Engineering, University of Minnesota at Minneapolis, in 2007. His research interests include topics in knowledge discovery a...View more

DEIS Department, University of Calabria, Rende, Cosenza, Italy
Sergio Flesca received the PhD degree in computer science from the University of Calabria. He was a research fellow at the Database and Artificial Intelligence (DBAI) group, Institute of Information Systems, Department of Computer Science, Vienna University of Technology, Austria. Currently, he is an associate professor at the University of Calabria. His research interests include databases, web and semistructured data management, information extraction, inconsistent data management, and approximate query answering.
Sergio Flesca received the PhD degree in computer science from the University of Calabria. He was a research fellow at the Database and Artificial Intelligence (DBAI) group, Institute of Information Systems, Department of Computer Science, Vienna University of Technology, Austria. Currently, he is an associate professor at the University of Calabria. His research interests include databases, web and semistructured data management, information extraction, inconsistent data management, and approximate query answering.View more

ICAR Institute of National Research Council, Rende, Cosenza, Italy
Elio Masciari received the PhD degree in computer science engineering at the University of Calabria, Italy. Currently, he is a senior researcher at the ICAR institute of the Italian National Research Council. His research activity is mainly focused on techniques for mining structured and unstructured data, XML query languages, and data streams.
Elio Masciari received the PhD degree in computer science engineering at the University of Calabria, Italy. Currently, he is a senior researcher at the ICAR institute of the Italian National Research Council. His research activity is mainly focused on techniques for mining structured and unstructured data, XML query languages, and data streams.View more

DEIS Department, University of Calabria, Rende, Cosenza, Italy
Andrea Tagarelli graduated magna cum laude in computer engineering in 2001, and received the PhD degree in computer and systems engineering in 2006. He is an assistant professor of computer science at the University of Calabria, Italy. He was research fellow at the Department of Computer Science and Engineering, University of Minnesota at Minneapolis, in 2007. His research interests include topics in knowledge discovery and text/data mining, information extraction, web and semistructured data management, uncertain data management, spatio-temporal databases, and applications in biomedicine.
Andrea Tagarelli graduated magna cum laude in computer engineering in 2001, and received the PhD degree in computer and systems engineering in 2006. He is an assistant professor of computer science at the University of Calabria, Italy. He was research fellow at the Department of Computer Science and Engineering, University of Minnesota at Minneapolis, in 2007. His research interests include topics in knowledge discovery and text/data mining, information extraction, web and semistructured data management, uncertain data management, spatio-temporal databases, and applications in biomedicine.View more