6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)

24-24 Sept. 1999

Filter Results

Displaying Results 1 - 25 of 47
  • 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)

    Publication Year: 1999
    Request permission for commercial reuse | PDF file iconPDF (230 KB)
    Freely Available from IEEE
  • X-tract: structure extraction from botanical textual descriptions

    Publication Year: 1999, Page(s):2 - 7
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (71 KB)

    Most available information today, both from printed books and digital repositories, is in the form of free-format texts. The task of retrieving information from these ever-growing repositories has become a challenge for information retrieval (IR) researchers. In some fields, such as botany and taxonomy, textual descriptions observe a set of rules and use a relatively limited vocabulary. This makes... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient uniform-cost normalized edit distance algorithm

    Publication Year: 1999, Page(s):8 - 15
    Cited by:  Papers (6)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (129 KB)

    A common model for computing the similarity of two strings X and Y of lengths m, and n respectively with m/spl ges/n, is to transform X into Y through a sequence of three types of edit operations: insertion, deletion, and substitution. The model assumes a given cost function which assigns a non-negative real weight to each edit operation. The amortized weight for a given edit sequence is the ratio... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fast algorithm on average for all-against-all sequence matching

    Publication Year: 1999, Page(s):16 - 23
    Cited by:  Papers (4)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (159 KB)

    We present an algorithm which attempts to align pairs of subsequences from a database of genetic sequences. The algorithm simulates the classical dynamic programming alignment algorithm over a suffix array of the database. We provide a detailed average case analysis which shows that the running time of the algorithm is subquadratic with respect to the database size. A similar algorithm solves the ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The ADT proximity and text proximity problems

    Publication Year: 1999, Page(s):24 - 30
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (177 KB)

    Practical text proximity problems lead to the abstract data type proximity that handles close points in the plane. Different variants and implementations of proximity are proposed and tight-complexity bounds based on information theory are derived. This problem is related to evaluating Boolean queries in large text retrieval (as in Web search engines) and to the "Sorting X+Y" problem. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An associative semantic model for text processing

    Publication Year: 1999, Page(s):31 - 37
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (158 KB)

    Natural language texts have an underlying structure that conveys an essential part of their information content. In order to better exploit text resources, this structure must be rendered explicit, which requires an automatic analysis based on local context and general world knowledge. The analysis must closely match the expectations of a typical reader. The paper presents a computational model th... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spaghettis: an array based algorithm for similarity queries in metric spaces

    Publication Year: 1999, Page(s):38 - 46
    Cited by:  Papers (13)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (293 KB)

    We present a new, pivot based data structure with dynamic capabilities to insert/delete both database elements and pivots. This feature is useful for applications where the database is not stationary, and the pivots must be changed from time to time. The spaghettis data structure can be thought of as a "flat" representation of a tree; but unlike it, a full representation of the distances can be us... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Circular contextual insertions/deletions with applications to biomolecular computation

    Publication Year: 1999, Page(s):47 - 54
    Cited by:  Papers (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (111 KB)

    Insertions and deletions of small circular DNA strands into long linear DNA strands are phenomena that happen frequently in nature and thus constitute an attractive paradigm for biomolecular computing. The paper presents a new model for DNA-based computation that involves circular as well as linear molecules, and that uses the operations of insertion and deletion. After introducing the formal mode... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bounds for parametric sequence comparison

    Publication Year: 1999, Page(s):55 - 62
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (180 KB)

    We consider the problem of computing a global alignment between two or more sequences subject to varying mismatch and indel penalties. We prove a tight 3(n/2/spl pi/)/sup 2/3/+O(n/sup 1/3/logn) bound on the worst-case number of distinct optimum alignments for two sequences of length n as the parameters are varied. This refines a O(n/sup 2/3/) upper bound by D. Gusfield et al. (1994). Our lower bou... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Motif detection in protein sequences

    Publication Year: 1999, Page(s):63 - 72
    Cited by:  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (189 KB)

    We use methods from data mining and knowledge discovery to design an algorithm for detecting motifs in protein sequences. Based on this approach, we have implemented a program called "GYM". The Helix-Turn-Helix Motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with ex... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A method of describing document contents through topic selection

    Publication Year: 1999, Page(s):73 - 80
    Cited by:  Papers (3)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (106 KB)

    Given a large hierarchical dictionary of concepts, the task of selection of the concepts that describe the contents of a given document is considered. The problem consists in proper handling of the top-level concepts in the hierarchy. As a representation of the document, a histogram of the topics with their respective contribution in the document is used. The contribution is determined by comparis... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient method for in memory construction of suffix arrays

    Publication Year: 1999, Page(s):81 - 88
    Cited by:  Papers (14)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (267 KB)

    The suffix array is a string-indexing structure and a memory efficient alternative to the suffix tree. It has many advantages for text processing. We propose an efficient algorithm for sorting suffixes. We call this algorithm the two-stage suffix sort. One of our ideas is to exploit the specific relationships between adjacent suffixes. Our algorithm makes it possible to use the suffix array for mu... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A unifying framework for compressed pattern matching

    Publication Year: 1999, Page(s):89 - 96
    Cited by:  Papers (11)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (132 KB)

    We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions, and propose a compressed pattern matching algorithm for the framework. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such comp... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fast distributed suffix array generation algorithm

    Publication Year: 1999, Page(s):97 - 104
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (251 KB)

    We present a distributed algorithm for suffix array generation, based on the sequential algorithm of U. Manber and E. Myers (1993). The sequential algorithm is O(nlogn) in the worst case and O(nloglogn) on average, where n is the text size. Using p processors connected through a high bandwidth network, we obtain O((n/p)loglogn) average time, which is an almost optimal speedup. Unlike previous algo... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Near optimal multiple sequence alignments using a traveling salesman problem approach

    Publication Year: 1999, Page(s):105 - 114
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (207 KB)

    We present a new method for the calculation of multiple sequence alignments (MSAs). The input to our problem are n protein sequences. We assume that the sequences are related with each other and that there exists some unknown evolutionary tree that corresponds to the MSA. One advantage of our method is that the scoring can be done with reference to this phylogenetic tree, even though the tree stru... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Practical constructions of L-restricted alphabetic prefix codes

    Publication Year: 1999, Page(s):115 - 119
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (85 KB)

    Information retrieval systems use various search techniques such as B-trees, inverted files and suffix arrays to provide quick response. Many of these techniques rely on string comparison operations. If a record field is coded using Huffman codes (D.A. Huffman, 1952) in order to save storage space, the field must be decoded before performing any comparison. On the other hand, if the field is alpha... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cross-domain approximate string matching

    Publication Year: 1999, Page(s):120 - 127
    Cited by:  Papers (1)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (121 KB)

    Approximate string matching is an important paradigm in domains ranging from speech recognition to information retrieval and molecular biology. We introduce a new formalism for a class of applications that takes two strings as input, each specified in terms of a particular domain, and performs a comparison motivated by constraints derived from a third, possibly different domain. This issue arises,... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fast and space-economical algorithm for calculating minimum redundancy prefix codes

    Publication Year: 1999, Page(s):128 - 134
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (121 KB)

    The minimum redundancy prefix code problem is to determine, for a given list W=[w/sub 1/,...,w/sub n/] of n positive symbol weights, a list L=[l/sub 1/,...,l/sub n/] of n corresponding integer codeword lengths such that /spl Sigma//sub i=1//sup n/2/sup -li//spl les/1 and /spl Sigma//sub i=1//sup n/w/sub i/l/sub i/ is minimized. With the optimal list of codeword lengths, an optimal canonical code c... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Linear time sorting of skewed distributions

    Publication Year: 1999, Page(s):135 - 140
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (76 KB)

    The article presents an efficient linear average time algorithm to sort lists of integers that follow skewed distributions. It also studies a particular case where the list follows Zipf's distribution, and presents an example application where the algorithm is used to reduce the time to build word-based Huffman codes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Searching in metric spaces by spatial approximation

    Publication Year: 1999, Page(s):141 - 148
    Cited by:  Papers (13)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (230 KB)

    We propose a novel data structure to search in metric spaces. A metric space is formed by a collection of objects and a distance function defined among them, which satisfies the triangular inequality. The goal is, given a set of objects and a query, retrieve those objects close enough to the query. The number of distances computed to achieve this goal is the complexity measure. Our data structure,... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effects of term segmentation on Chinese/English cross-language information retrieval

    Publication Year: 1999, Page(s):149 - 157
    Cited by:  Papers (1)  |  Patents (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (170 KB)

    The majority of recent Cross-Language Information Retrieval (CLIR) research has focused on European languages. CLIR problems that involve East Asian languages such as Chinese introduce additional challenges, because written Chinese texts lack boundaries between terms. The paper examines three Chinese segmentation techniques in combination with two variants of dictionary-based Chinese to English qu... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • String-oriented databases

    Publication Year: 1999, Page(s):158 - 167
    Cited by:  Papers (2)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (147 KB)

    Relational databases and Datalog view each attribute as indivisible. This view, though useful in several applications, does not provide a suitable database paradigm for use in genetic, multimedia or scientific databases. Data in these applications are unstructured; querying on sub-strings of attribute values is often necessary. Moreover due to imprecision and incompleteness in the data, approximat... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Contextual array splicing systems

    Publication Year: 1999, Page(s):168 - 175
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (161 KB)

    The concept of splicing is extended to arrays. A new method of splicing called Contextual Array Splicing is introduced which produces imperfect molecules throughout the structure. This model is capable of generating interesting patterns. We prove that if an array language is p-column strictly locally testable, then all arrays of column size p are constants. The concept of Mixed Splicing is also in... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Top-down extraction of semi-structured data

    Publication Year: 1999, Page(s):176 - 183
    Cited by:  Papers (3)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (167 KB)

    We propose an innovative approach to extracting semi-structured data from Web sources. The idea is to collect a couple of example objects from the user and to use this information to extract new objects from new pages or texts. We propose a top-down strategy that extracts complex objects, decomposing them in objects less complex, until atomic objects have been extracted. Through experimentation, w... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CoBWeb-a crawler for the Brazilian Web

    Publication Year: 1999, Page(s):184 - 191
    Cited by:  Papers (4)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (77 KB)

    One of the key components of current Web search engines is the document collector. The paper describes CoBWeb, an automatic document collector whose architecture is distributed and highly scalable. CoBWeb aims at collecting large amounts of documents per time period while observing operational and ethical limits in the crawling process. CoBWeb is part of the SIAM (Information Systems in Mobile Com... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.