Skip to Main Content
Often document dissemination is limited to a "need to know" basis so as to better maintain organizational trade secrets. Retrieving documents that are off-topic to a user's predefined area of information need (task) via a search engine is potentially a violation of access rights and is a concern to every private, commercial, and governmental organization. Such misuse, defined as "off-topic access to sensitive data by an authorized user", is the second most prevalent form of computer crime after viruses per a recent Computer Security Institute/Federal Bureau of Investigation study. We present a content-based off-topic detection approach that uses query result clustering to detect off-topic searches. This approach supports higher detection precision than the state of the art. Multiple methods for picking the "good" clusters are proposed, and their effect on the detection rate and precision is evaluated. A high detection precision is critical as a false access violation accusation unfairly and inappropriately subjects the user to scrutiny. Our empirical results show that using clustering query results can significantly reduce such false positives.