Skip to Main Content
Against the low index speed of serial algorithm for Web page inverted indexes construction, according to a characteristic of merge-sort algorithm meets the theory of scheduling divisible loads in parallel and distributed system, the paper proposed a new parallel algorithm basing on the triple sort-merge for Web page inverted indexes construction. The algorithm distributed parallel dealt with the two tasks parsing term and sorting these term postings which spent lots of time in the construction of inverted indexes, each term was represented as a triple, the time complexity of the algorithm was analyzed. This paper also applied a Java middleware named ProActive, designed and implemented a distributive parallel Web page indexer named P_Indexer on the cluster computing systems. The algorithm analysis and experimental results showed the parallel algorithm reaches high efficiency and good scalability.