Loading [MathJax]/extensions/MathMenu.js
Large scale Hamming distance query processing | IEEE Conference Publication | IEEE Xplore

Large scale Hamming distance query processing


Abstract:

Hamming distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming distance range query problem...Show More

Abstract:

Hamming distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, we have a static Hamming distance range query problem. If k is part of the input, we have a dynamic Hamming distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming distance range query algorithm called HEngined, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.
Date of Conference: 11-16 April 2011
Date Added to IEEE Xplore: 16 May 2011
ISBN Information:

ISSN Information:

Conference Location: Hannover, Germany

I. Introduction

Given two equal length strings and , the Hamming distance between them, denoted , is the number of positions where the two strings differ. For example, . Several classic similarity search problems are defined for Hamming distance. In particular, there is the Hamming distance range query problem, which was first posed by Minsky and Papert in 1969 [22], where the goal is to find all strings in a database that are within a Hamming distance bound from a query string. A related problem is the nearest neighbor (KNN) problem, where the goal is to find the nearest neighbors in a given database with respect to Hamming distance to a given query string. Although these problems are extremely simple and natural problems, finding fast and space efficient algorithms for solving them remains a fundamental open problem where new results are still emerging.

Contact IEEE to Subscribe

References

References is not available for this document.