By Topic

Similarity-based Retrieval in High Dimensional Data with Recursive Lists of Clusters: A Study Case with Natural Language Dictionaries

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
Barbosa, F. ; Dept. de Inf., UNL, Caparica

An important issue in similarity-based retrieval in high dimensional data objects is the data representation. In order to use an indexing structure that can effectively handle large databases, it is essential to reduce the dimensionality of the data objects. The symbolic representation of the objects is a promising technique of dimension reduction, which allows researchers to avail from the area of text-retrieval algorithms and techniques. A similar searching engine consists in finding the objects similar to a given objects in some collection. Comparing the given object to every other object in a large database is prohibitively slow. If objects can be placed in a metric space, the search can be sped up by comparing the query object to a restricted number of objects, rather than the entire database. If the objects are strings (text) and a "good" metric to compare objects exists, we get a metric space. In order to have efficient similar searching in metric spaces, metric data structures are used. We evaluate the performance of range queries in the recursive lists of clusters (RLC) metric data structure, when the metric spaces are natural language dictionaries with the extended edit distance (EED). The study compares RLC with Vp-Tree data structure in six different dictionaries, which are characterized according to the mean and the variance of the histograms of distances. The experimental results show that RLC has a good performance in all the tested cases and, in some of them it outperforms the Vp-tree data structure. In addition, RLC is the only data structure that always keeps its good performance, when the space dimension is lower or higher.

Published in:

Information Management and Engineering, 2009. ICIME '09. International Conference on

Date of Conference:

3-5 April 2009