Skip to Main Content
The identification of duplicated web pages is one of the related steps in search engine. The effect of the identification will affect search engine's performance. This article studies and summarizes the basic processing steps, key technologies of duplicated web pages identification in search engine. On the basis of some experiments, we analyze and contrast some basic algorithms' performance. Then summarizes their advantages and disadvantages. Finally, we proposes an idea that use the distributed computing such as Hadoop to identify the duplicated web pages in order to make more efficiency when we try to process the massive internet information in search engine.