By Topic

Web Based Cross Language Semantic Plagiarism Detection

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Chow Kok Kent ; Fac. of CS & Inf. Sys., Univ. Teknol. Malaysia, Skudai, Malaysia ; Naomie Salim

As the Internet help us cross language and cultural border and with different types of translation tools, cross language plagiarism is bound to rise. Besides that, semantic plagiarism, where the student reconstructs the sentence or changes some terms into its corresponding synonyms, also raises concerns in the academic field. Both of this plagiarism is hardly detected due to the difference in their fingerprints. Plagiarism detection tools available are not capable to detect such plagiarism cases. In this research, we propose a new approach in detecting both cross language and semantic plagiarism. We consider Bahasa Melayu as the input language of the submitted document and English as a target language of similar, possibly plagiarised documents. In this system we shorten the query document by utilising fuzzy swarm-based summarisation approach. Our point of view is that using the summary will give us the most important keywords in the document. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. We integrate the use of Stanford Parser and Word Net to determine the semantic similarity level between the suspected documents with candidate source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as Nouns, Verbs and Adjectives. Based on these roles, we represent each sentence in a predicate form and similarity is measured based on those predicates using information content value from Word Net taxonomy. Our testing dataset is built up from two sets of Malay documents which are produced based on different plagiarism techniques. The result of our proposed semantic based similarity measurement shows that it can achieve higher precision, recall and F-Measure compared to the conventional Longest- Common Subsequence (LCS) approach.

Published in:

Dependable, Autonomic and Secure Computing (DASC), 2011 IEEE Ninth International Conference on

Date of Conference:

12-14 Dec. 2011