Skip to Main Content
Many Web sites contain a large collection of "structured" Web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. Our goal is to automatically extract structured data from a collection of pages described above, without any human input like manually generated rules or training sets. Extracting structured data gives us greater querying power over the data and is useful in information integration systems. Our approach consists of two stages. In the first stage, the unknown template used to create the pages is deduced. In the second stage, the deduced template is used to extract the values. We focus on the first stage since it is more challenging. The full version contains formal definition of high occurrence correlation and our algorithm. We evaluated our approach by considering 9 real collections of pages.