Skip to Main Content
Web Content extraction is the task of extracting structured information from unstructured and semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images and audio, video could be seen as information extraction. Similarly, information retrieval is the process which is based on user's query. The retrieved information is to be extracted using the web content extraction concept. The Challenges for this type of web page content extraction is increasing now-a-days. In this work, we study the problem of automatically extracting the contents from the web pages. Many more researches have been done to address this problem. The existing approaches have some limitations such as that, it has no sufficient power to deal with the large number of web pages and also that they are web-page-programming- language(HTML) dependent. Our proposed work is to overcome the limitations of the existing system. This work deals with information retrieval process in which the Vision based approach is applied, which helps to extract both images and text from the web pages. In fact most of researches show that when a page is presented to the user, the spatial and visual features play a very important role because they help the user to unconsciously divide the webpage into several semantic parts. Hence, proposed work focus on the primary visual features of a web page. The extraction is carried out on the basis of these features. This approach can gain a better performance when compared with other traditional methods.