Skip to Main Content
A crucial preprocessing stage in applications such as OCR is text extraction from mixed-type documents. The present work, in contrast to most until now, successfully faces the problem of varying text orientation and size. The technique first identifies marks using a contour following technique, followed by a PCA (principal component analyzer) which determines the direction of the main axis of each mark. Next, a nearest-neighbor technique is employed to find the shortest distances between marks, and a feature vector is formed based on calculated mark dimensions and distances, which is then fed into a SOFM (self organizing feature map) which defines homogeneous mark clusters. Resulting cluster weights and variances are used to form a set of fuzzy rules, and a fuzzy classification scheme identifies marks as characters or non-characters. The technique succeeds in correctly and quickly extracting text areas in a variety of mixed-type documents.
Date of Conference: 8-11 Dec. 2008