Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

Research Issues in Data Engineering: Multi-lingual Information Management, 2003. RIDE-MLIM 2003. Proceedings. 13th International Workshop on

Date 10-11 March 2003

Filter Results

Displaying Results 1 - 11 of 11
  • Exploiting multi-lingual text potentialities in EBMT systems

    Publication Year: 2003 , Page(s): 9 - 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (522 KB) |  | HTML iconHTML  

    Translating documents from a source to a target language is a repetitive activity. The attempt to automate such a difficult task has been a long-term scientific dream. Among the several types of approaches in machine translation (MT), one of the most promising paradigms is example-based machine translation (EBMT). An EBMT system translates by analogy, using past translations to translate other similar source-language material into the target language. In this paper, we introduce EXTRA (EXample-based TRanslation Assistant), a complete EBMT system that exploits some innovative ideas in information retrieval and multilingual text management to effectively and efficiently extract useful suggestions from past translations and present them to the translator. This work has been developed as a joint work with the LOGOS group, a worldwide leader in multilingual document translation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On database support for multilingual environments

    Publication Year: 2003 , Page(s): 23 - 30
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (621 KB) |  | HTML iconHTML  

    Global e-commerce and mass-outreach e-governance programs have brought into sharp focus the need for database systems to store and manipulate text data efficiently in a suite of natural languages. While some means of storing and querying multilingual data are provided by all current database systems, to the best of our knowledge, there has been no prior study of their functionality or efficiency in this regard. In this paper, we explore the multilingual support needed by the user community and what is currently provided by the popular database systems to satisfy these needs. Specifically, a comparison of multilingual features supported by the database systems is provided against a set of relevant parameters. Initial results from our performance study indicate that serious lacunae exist in the performance with respect to multilingual data. We propose a new data type and associated database system architecture components for making the performance of the database system to be language independent. Results from our initial implementation of the proposed methodology are encouraging indicating the value of such an approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semi-automatic indexing of documents with a multilingual thesaurus

    Publication Year: 2003 , Page(s): 31 - 38
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (560 KB) |  | HTML iconHTML  

    With the growing significance of digital libraries and the Internet, more and more electronic texts become accessible to a wide and geographically disperse public. This requires adequate tools to facilitate indexing, storage, and retrieval of documents written in different languages. We present a method for semi-automatic indexing of electronic documents and construction of a multilingual thesaurus, which can be used for query formulation and information retrieval. We use special dictionaries and user interaction in order to solve ambiguities and find adequate canonical terms in the language and an adequate abstract language-independent term. The abstract thesaurus is updated incrementally by new indexed documents and is used to search for documents using adequate terms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Event information extraction using link grammar

    Publication Year: 2003 , Page(s): 16 - 22
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (450 KB) |  | HTML iconHTML  

    In this paper, we present a scheme for identifying instances of events and extracting information about them. The scheme can handle all events with which an action can be associated, which covers most types of events. Our system basically tries to extract semantic information from the syntactic structure given by the link grammar system described by D. Sleator and D. Temperly (1991) to any English sentence. The instances of events are identified by finding all sentences in the text where the verb, which best represents the action in the event, or one of its synonyms/hyponyms occurs as a main verb. Then, information about that instance of the event is derived using a set of rules which we have developed to identify the subject and object as well as the modifiers of all verbs and nouns in any English sentence, making use of the structure given by the link parser. The scheme was tested on the Reuters corpus and gave recall and precision even up to 100%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ABHIDHA: an extended WordNet for Indo Aryan languages

    Publication Year: 2003 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (609 KB) |  | HTML iconHTML  

    A lexical knowledge base is an important component of any intelligent information processing system. The WordNet developed at the Cognitive Systems Laboratories at Princeton has served as a lexical reference system for natural language processing activities. The Indian language based activities at our institute mainly in text-to-speech synthesis and natural language generation from iconic inputs require the inclusion of additional features in the lexical reference system like phonology, word roots, and etymological information. Our initial efforts have been in Hindi and Bengali but commonality of Indo Aryan Languages and the importance of these extra features lead us to believe that it is a worthwhile effort to build-up a WordNet for other Indo Aryan languages containing these features. In this paper, we speak of the issues relating to the structured design and development of a generalized extended WordNet for Indo Aryan languages with special reference to Hindi and Bengali. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Creation of data resources and design of an evaluation test bed for Devanagari script recognition

    Publication Year: 2003 , Page(s): 55 - 61
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (509 KB) |  | HTML iconHTML  

    The Indian subcontinent has a large number of languages, dialects, and scripts with the Devanagari script being the primary and most widely used of all the scripts. To date, much of the Devanagari optical character recognition (OCR) research has been restricted to a handful of groups. So, techniques have not yet been widely disseminated or evaluated independently and automated evaluation tools are currently not available for lack of a standard representation of ground-truth and result data. A key reason for the absence of sustained research efforts in off-line Devanagari OCR appears to be the paucity of data resources. Ground truthed data for words and characters, on-line dictionaries, corpora of text documents and reliable, standardized statistical analyses and evaluation tools are currently lacking. So, the creation of such data resources will undoubtedly provide a much needed fillip to researchers working on Devanagari OCR. This paper describes a National Science Foundation sponsored project under the International Digital Libraries program to create data resources that will facilitate development of Devanagari OCR technology and provide a standardized test bed and evaluation tools for Devanagari script recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An extensible approach to high-quality multilingual typesetting

    Publication Year: 2003 , Page(s): 62 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (477 KB) |  | HTML iconHTML  

    We propose to create and study a new model for the micro-typography part of automated multilingual typesetting. This new model will support quality typesetting for a number of modern and ancient scripts. The major innovations in the proposal are: the process is refined into four phases, each dependent on a multidimensional tree-structured context summarizing the current linguistic and cultural environment. The four phases are: preparing the input stream for typesetting; segmenting the stream into clusters (words); typesetting these clusters; and then recombining the clusters into a typeset text stream. The context is pervasive throughout the process; the algorithms used in each phase are context-dependent, as are the meanings of fundamental entities such as language, script, font and character. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Script-based classification of hand-written text documents in a multilingual environment

    Publication Year: 2003 , Page(s): 47 - 54
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (579 KB) |  | HTML iconHTML  

    Script-based text document classification is an important field of research in the context of multilingual textual document processing. But, all script identification techniques available in the literature so far do not consider handwritten documents. Variations in the writing style, character size, inter-line and inter-word spacings, etc. make the recognition process difficult and unreliable when these script identification algorithms, more specifically visual appearance based approaches, are applied directly on hand-written documents. Therefore, in this paper, we propose to preprocess the input document images so as to compensate for the variations due to writing style and thereby making them suitable for analysis on the basis of their visual appearances. Accordingly, we apply denoising, thinning, pruning, m-connectivity and text size normalization in sequence. Multi-channel Gabor filtering is used to extract texture features that characterize the visual appearances of the document images. Experimental result proves the potentiality of our proposed method of script identification for hand-written text document classification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Correlating summarization of a pair of multilingual documents

    Publication Year: 2003 , Page(s): 39 - 46
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (541 KB) |  | HTML iconHTML  

    With the emergence of enormous amount of documents in multiple languages, it is desirable to construct text mining methods that can compare and highlight similarities of them. In this paper, we explore the research issue of comparative summarization for a pair of multilingual documents. A bipartite graph based algorithm is proposed to correlate textual content against sources in various languages. The algorithm aligns the (sub)topics of a pair of multilingual documents and summarizes their correlation by sentence extraction. A pair of documents in different languages is modeled with a weighted bipartite graph. A mutual reinforcement principle is applied to identify a dense subgraph of the weighted bipartite graph. Sentences corresponding to the subgraph are correlated well in textual content and convey the dominant shared topic of the pair of documents. As a further enhancement, a bi-clustering algorithm can first be used to partition the bipartite graph into several clusters, each containing sentences from the two documents. These clusters correspond to shared subtopics, and the above mutual reinforcement principle can be applied to extract topic sentences within each subtopic group. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proceedings. Thirteenth International Workshop on Research Issues in Data Engineering: Multi-lingual Information Management. RIDE-MILM 2003 (IEEE Cat.No.03TH8687)

    Publication Year: 2003
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (85 KB)  

    The following topics are dealt with: NLP (natural language processing) technologies for MLIM (multi-lingual information management); system issues in MLIM; and multilingual text processing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Publication Year: 2003 , Page(s): 68
    Save to Project icon | Request Permissions | PDF file iconPDF (19 KB)  
    Freely Available from IEEE