Loading [MathJax]/extensions/MathMenu.js
A comparative study of two recent word spotting techniques in the run-length compressed domain | IEEE Conference Publication | IEEE Xplore

A comparative study of two recent word spotting techniques in the run-length compressed domain


Abstract:

This paper presents a comparative study of two recent word spotting techniques ([1] and [2]) directly in the run-length compressed domain. The first technique is based on...Show More

Abstract:

This paper presents a comparative study of two recent word spotting techniques ([1] and [2]) directly in the run-length compressed domain. The first technique is based on partial decompression and limited usage of OCR, and the second technique is completely decompression-less and OCR-less. Both the word spotting techniques use word bounding box ratio feature initially for matching words in the database of compressed document images. For all the matching test-words, the word spotting strategy in the first model is to decompress and OCR first two characters, and then match with the keyword characters. If the matching is successful, then the remaining characters of the test-word are decompressed and OCRed, and eventually matched with the keyword. The word spotting strategy applied in the second model is to extract run based features like number of run transitions and the corresponding correlation of runs along the selected regions of the matching test word, and then match with that of the specified keyword. The proposed methods work in Run-Length Compressed Domain (RLCD) with the capability of operating on CCITT Group 3 1D, CCITT Group 3 2D, and CCITT Group 4 2D compressed documents supported by TIFF and PDF file formats. In the current paper, the efficacy of the proposed models is demonstrated through experimental results and comparative analysis.
Date of Conference: 13-16 September 2017
Date Added to IEEE Xplore: 04 December 2017
ISBN Information:
Conference Location: Udupi, India

I. Introduction

In an attempt to move towards paperless office, a large number of printed documents are being digitized and archived in different image databases, digital libraries, and internet applications with an intention of preserving these documents for long term use, to serve large groups of people, and also to felicitate e-governance applications [3]. Such a huge collection of document images poses a challenge when searching for relevant documents, and moreover searching is an important and frequently used operation. Since the archived documents are in the image form, the existing text processing (searching) algorithms fail to operate over them, and this necessitates the providing of a facility to search the relevant documents in the image form itself [3], [4]. In the literature, two important techniques have been proposed to address this issue; these are based on the concepts of Digital Image Processing (DIP) and Document Image Retrieval (DIR) [5]. The first approach relies on the usage of digital image processing techniques that analyze the text areas in the document image and convert them into machine readable ASCII text, thus making the text searchable using simple text processing algorithms. The DIP techniques employ suitable text segmentation algorithms and subsequently use Optical Character Recognition (OCR) to bring the text contents into an editable ASCII form. However, OCR based techniques are very sensitive to the noise and degradation present in the document image [3], [4], and the performance of the OCR depends largely on the quality of the input image and the segmentation algorithm that is applied. Because of these limitations, a new technique of DIR known as keyword/word spotting was introduced by [6], and later improved by many researchers as reported in [5]. The keyword spotting technique is an OCR-less approach for locating the specified keywords in the document image which works on the principle of image matching [6], [5].

Contact IEEE to Subscribe

References

References is not available for this document.