Abstract:
Summary form only given. We address the problem of positional indexing in natural language domain. The positional inverted index contains the information of the word posi...Show MoreMetadata
Abstract:
Summary form only given. We address the problem of positional indexing in natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original textfile, which implies that it is not necessary to store the originalfile. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. The inverted lists of single terms are combined into one inverted list that represents a backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with presentation layer that stores separators, stopwords, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. Our experiments prove that PISI is not far from standard positional inverted index in terms of search speed and, at the same time, it is more effective in memory consumption. PISI also proved that it is significantly faster than its close competitor FWCSA in terms of search speed at the same level of memory consumption.PISI naturally undergoes all usual procedures during the construction phase.The indexed text is case folded (all letters are reduced to lower case), stopped(so-called stopwords are omitted) and stemmed (all words are reduced to theirstems using Porter stemming algorithm). PISI uses its presentation layer (proposed by Farina et al. [1]) to store the information lost during the aforementioned procedures. The presentation layer contains one (possibly empty) slot for every word of the inverted list. The slot is composed of the Huffman codes of all non-alphanumeric words and all stopwords preceding the corresponding indexed word. We compared three different indexes in the experimental part: our PISI, word-based self-index FWCSA proposed by Fari~na et al. in [1] and standard positional inverted index II. The fastest instance of PISI with achieved compression ratio 42:91 % proved to be 23 time...
Published in: 2016 Data Compression Conference (DCC)
Date of Conference: 30 March 2016 - 01 April 2016
Date Added to IEEE Xplore: 19 December 2016
ISBN Information:
Electronic ISSN: 1068-0314