Skip to Main Content
Text plagiarism is growing rapidly with the development of Internet, so many plagiarism detection algorithms have been proposed. However, most algorithms focus on the optimized one-to-one comparison, rather than massive document comparison. The latter algorithms have a limitation in time performance when users conduct an exhaustive search on a huge set of documents. In this paper, we propose an optimized preprocessing model to detect similar text in massive document repositories. This model uses an efficient data structure called GDIC (Global Dictionary) for preprocessing. After filtering stop words, we choose pairs of documents to be inspected using two methods at the same time, both of which use the concept of a common non-stop word to choose pairs of documents to be inspected, each of which uses it in a slightly different way. The first method chooses pairs of documents with a high frequency of common non-stop words in documents in each of these pairs, while the second method chooses pairs with a high proportion of common non-stop words. We experimentally prove the performance of the model. Our experiments with the proposed preprocessing model is drastically reduced searching time to 64~87%, while the sensitivity stands at 77~96%. When we use this model, GDIC generation time accounts for a large proportion of all of the detection time. In future work, we will optimize GDIC creation time to improve the performance of the entire system.