Abstract:
Many Information Retrieval (IR) and Natural language processing (NLP) systems require textual similarity measurement in order to function, and do so with the help of simi...Show MoreMetadata
Abstract:
Many Information Retrieval (IR) and Natural language processing (NLP) systems require textual similarity measurement in order to function, and do so with the help of similarity measures. Similarity measures function differently, some measures which work better on highly similar texts do not always do so well on highly dissimilar texts. In this paper, we evaluated the performances of eight popular similarity measures on four levels (degree) of textual similarity using a corpus of plagiarised texts. The evaluation was carried out in the context of candidate selection for plagiarism detection. Performance was measured in terms of recall, and the best performed similarity measure(s) for each degree of textual similarity was identified. Results from our Experiments show that the performances of most of the measures were equal on highly similar texts, with the exception of Euclidean distance and Jensen-Shannon divergence which had poorer performances. Cosine similarity and Bhattacharryan coefficient performed best on lightly reviewed text, and on heavily reviewed texts, Cosine similarity and Pearson Correlation performed best and next best respectively. Pearson Correlation had the best performance on highly dissimilar texts. The results also show term weighing methods and n-gram document representations that best optimises the performance of each of the similarity measures on a particular level of intertextual similarity.
Published in: 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)
Date of Conference: 12-14 November 2015
Date Added to IEEE Xplore: 01 August 2016
ISBN Information:
Conference Location: Lisbon, Portugal