Skip to Main Content
This paper proposes a novel method to detect text similarity over Chinese research papers using MapReduce paradigm. Our approach differs from the state-of-the-art methods in two aspects. First, we extract the key sentences from Chinese research papers by using some heuristic features and then generate 2-tuple, (document id, key phrase), as the representation of the documents. Second, we design 2-phrase MapReduce algorithm to verify the effectiveness of the generated 2-tuple. For evaluation, we compare the proposed method with other approaches on synthetic corpus. Experimental results review that our method much outperforms the state-of-the-art ones on running time performance while guarantee the Jaccard similarity coefficient.