Abstract:
The Google newspaper search program was launched on September 8, 2008. In this paper, we outline the technology pieces underlying this large and complex project. We have ...Show MoreMetadata
Abstract:
The Google newspaper search program was launched on September 8, 2008. In this paper, we outline the technology pieces underlying this large and complex project. We have created a production pipeline which takes newspaper microfilms as input and emits individual news articles as output. These articles are then indexed and added to the content base, so that they turn up in response to Google searches. Thus, in response to a Google query ldquoHitler deathrdquo, we are able to show newspaper articles from the very day it was reported, authentic and unbiased by passage of time. Non-uniform illumination, presence of significant noise, tears and scratches in the microfilm image, all pose special challenges for this project. The significant variation of layouts across newspapers and time eras, the variations in font sizes occurring in a single page (which confuses the OCR engine) compound the difficulties. The project is still going on after the initial launch was made (with about 15 million news articles).
Date of Conference: 26-29 July 2009
Date Added to IEEE Xplore: 02 October 2009
ISBN Information: