By Topic

Asian Language Processing (IALP), 2011 International Conference on

Date 15-17 Nov. 2011

Filter Results

Displaying Results 1 - 25 of 82
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (12890 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (68 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (116 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - ix
    Save to Project icon | Request Permissions | PDF file iconPDF (155 KB)  
    Freely Available from IEEE
  • Message from the General Chair

    Page(s): x
    Save to Project icon | Request Permissions | PDF file iconPDF (89 KB)  
    Freely Available from IEEE
  • Message from the Program Chairs

    Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Message from the Local Organizing Chair

    Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (73 KB)  
    Freely Available from IEEE
  • Conference Committees

    Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (57 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): xiv - xv
    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • Invited Talks

    Page(s): xvi - xx
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (99 KB)  

    More than 6000 living languages are spoken in the world today, and the majority of them are concentrating in Asia. Every language has its own specific acoustic as well as linguistic characteristics that require special modeling techniques. This talk presents our recent experiences in regard to building automatic speech recognition (ASR) systems for the Indonesian, Thai and Chinese languages. For Indonesian, we are building a spoken-query information retrieval (IR) system. In order to solve the problem of a large variation of proper noun and English word pronunciation, we have applied proper noun-specific adaptation in acoustic modeling and rule-based English- to-Indonesian phoneme mapping. For Thai, since there is no word boundary in the written form, we have proposed a new method for automatically creating word-like units from a text corpus, and to recognize spoken style utterances we have applied topic and speaking style adaptation to the language model. In spoken Chinese, long organization names are often abbreviated, and abbreviated utterances cannot be recognized if the abbreviations are not included in the dictionary. We have proposed a new method for automatically generating Chinese abbreviations, and by expanding the vocabulary using the generated abbreviations, we have significantly improved the performance of voice search. This talk includes several recent research activities for the Japanese language. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Simplified-Traditional Chinese Character Conversion Model Based on Log-Linear Models

    Page(s): 3 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (165 KB) |  | HTML iconHTML  

    With the growth of exchange activities between four regions of cross strait, the problem to correctly convert between Traditional Chinese (TC) and Simplified Chinese (SC) become more and more important. Numerous one-to-many mappings and term usage differences make it more difficult to convert from SC to TC. This paper proposed a novel simplified-traditional Chinese character conversion model based on log-linear models, in which features such as language models and lexical semantic consistency weighs are integrated. When estimating lexical semantic consistency weighs, cross-language word-based semantic spaces were used. Experiments were conducted and the results show that the proposed model achieve better performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Chinese Dependency Parsing with Self-Disambiguating Patterns

    Page(s): 7 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (180 KB) |  | HTML iconHTML  

    To solve the data sparseness problem in dependency parsing, most previous studies used features constructed from large-scale auto-parsed data. Unlike previous work, we propose a new approach to improve dependency parsing with context-free dependency triples (CDT) extracted by using self-disambiguating patterns (SDP). The use of SDP makes it possible to avoid the dependency on a baseline parser and explore the influence of different types of substructures one by one. Additionally, taking the available CDTs as seeds, a label propagation process is used to tag a large number of unlabeled word pairs as CDTs. Experiments show that, when CDT features are integrated into a maximum spanning tree (MST) dependency parser, the new parser improves significantly over the baseline MST parser. Comparative results also show that CDTs with dependency relation labels perform much better than CDT without dependency relation label. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint Decoding for Chinese Word Segmentation and POS Tagging Using Character-Based and Word-Based Discriminative Models

    Page(s): 11 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (265 KB) |  | HTML iconHTML  

    For Chinese word segmentation and POS tagging problem, both character-based and word-based discriminative approaches can be used. Experiments show that these two approaches bring different errors and can complement each other. In this paper, we propose a joint decoding model based on both character-based and word-based models using multi-beam search algorithm. Experimental results show that the joint decoding model outperforms character-based and word-based baseline models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Natural Language Grammar Induction of Indonesian Language Corpora Using Genetic Algorithm

    Page(s): 15 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (164 KB) |  | HTML iconHTML  

    Grammar Induction is a machine learning process for learning grammar from corpora. This paper will discuss the process of grammar induction for Indonesian language corpora using genetic algorithm. The Grammar production rules will be modeled in the form of chromosomes. The fitness function is used to count how many sentences can be parsed. The data used are Indonesian fairy tales stories such as "Bawang Merah Bawang Putih" and "Malin Kundang". This paper describes the detailed explanations about the steps of each process carried out for natural language grammar problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Error-Driven Adaptive Language Modeling for Chinese Pinyin-to-Character Conversion

    Page(s): 19 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (209 KB) |  | HTML iconHTML  

    The performance of Chinese Pinyin-to-Character conversion is severely affected when the characteristics of the training and conversion data differ. As natural language is highly variable and uncertain, it is impossible to build a complete and general language model to suit all the tasks. The traditional adaptive MAP models mix the task independent data with task dependent data using a mixture coefficient but we never can predict what style of language users have and what new domain will appear. This paper presents a statistical error-driven adaptive language modeling approach to Chinese Pinyin input system. This model can be incrementally adapted when an error occurs during Pinyin-to-Character converting time. It significantly improves Pinyin-to-Character conversion rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Theoretical Framework of Mongolian Word Segmentation Specification for Information Processing

    Page(s): 23 - 25
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (125 KB) |  | HTML iconHTML  

    The establishment of Contemporary Mongolian word segmentation specification for information processing has a great significance in the standardization of information processing, the compatibleness of different systems, the sharing of corpus, grammatical analysis, and POS tagging. The present paper studies the framework of Mongolian word segmentation including guidelines, formulating principles, styles, scopes of segmentation units, establishment foundation, structure of the specification and so on, and lays the theoretical foundation for this specification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Research on the Uyghur Information Database for Information Processing

    Page(s): 26 - 29
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (153 KB) |  | HTML iconHTML  

    Although the "grammatical rule + dictionary" is the traditional pattern for natural language processing, it can be hard to explain the combination of words in language. If all word combinations are entered into a database, the grammar and the information system would be simplified. The necessity, the methods and the principles of establishing the phrase information database of Uyghur language will be discussed in the paper on the basis of the review of the "little grammar in the big word storehouse". View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result

    Page(s): 30 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (162 KB) |  | HTML iconHTML  

    Recently, natural language processing tasks are more frequently conducted over online content. This poses a special problem for applications over Arabic language. Online Arabic content is usually written in informal colloquial Arabic, which is characterized to be ill-structured and lacks specific linguistic standardization. In this paper, we investigate a preliminary step to conduct successful NLP processing which is the problem of sentence boundary detection. As informal Arabic lacks basic linguistic rules, we establish a list of commonly used punctuation marks after extensively studying a large amount of informal Arabic text. Moreover, we evaluated the correct usage of these punctuation marks as sentence delimiters; the result yielded a preliminary accuracy of 70%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Study of the Classification and Arrangement Rule of Uygur Morphemes for Information Processing

    Page(s): 33 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (142 KB) |  | HTML iconHTML  

    In the processing of modern uygur corpus, it is necessary to make a word character mark study of the word level within the modern uygur language data. Since the classification of morpheme is to serve the mark of word character, the article classifies Uygur morphemes from their functions and lists their all classifications and arrangement rules. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Graph-Based Language Model of Long-Distance Dependency

    Page(s): 37 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (169 KB) |  | HTML iconHTML  

    In the natural language processing and its related fields, the classic text representation methods seldom consider the role of the words order and long-distance dependency in the texts for the semantic representation. In this paper, we discussed current situation and problems of the statistical language models, especially for Head-driven statistical language model and Head-driven Phrase Structure Grammar (HPSG). And then the development and realization methods of the long-distance dependency language model simply introduced. At last graph-based long-distance dependency language model was proposed in the paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BASRAH: Arabic Verses Meters Identification System

    Page(s): 41 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (203 KB) |  | HTML iconHTML  

    In this paper, we present BASRAH, a system that automatically identifies the meter of Arabic verse, which is an operation that requires a certain level of human expertise. BASRAH uses the numerical prosody method, which depends on verse coding that is derived from the general concept of al-Khalil's feet through using the two primary units (cord=2 and peg=3). BASRAH has proved to be an efficient tool to help inexperienced users to determine the meters of Arabic verses when we tested it on thousands of old and modern Arabic verses. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • WordNet Editor to Refine Indonesian Language Lexical Database

    Page(s): 47 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (259 KB) |  | HTML iconHTML  

    This paper describes an approach for editing Indonesian Language Lexical Database especially noun category and its relations. The purpose of this editor is to refine Indonesian Lexical Database that was developed in our previous researches. The visualization of the editor is using graph library with some modifications and additions. Furthermore, this editor will be web based so that everyone can participate to improve Indonesian Language Lexical Database. There is an administrator role that had to accept or reject any suggestion for the changes suggested by any member. We believe that this editing approach can also be used to improve WordNet developed in other languages. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two Ontological Approaches to Building an Intergrated Semantic Network for Yami ka-Verbs

    Page(s): 51 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (380 KB) |  | HTML iconHTML  

    This paper describes a proposed ontological language processing system for integrating two semantic sets for a group of important verbs with the prefix Kain Yami, an Austronesian language in Taiwan. The two semantic sets represent two different classification approaches. One approach follows the concepts and rules of WordNet and the other uses the metaphors in Yami indigenous knowledge. The ontologies are used for classification and semantic integration. The results of implementation are used for building the Yami lexical database. This paper illustrates how the methodology and framework used in classifying Yami can be applied to Austronesia language processing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Issues with the Unergative/Unaccusative Classification of the Intransitive Verbs

    Page(s): 55 - 58
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (267 KB) |  | HTML iconHTML  

    The paper abandons a strict two-way sub-classification of intransitive verbs into unaccuasative and unergative for Hindi and proposes a distribution plotting of the same in a diffusion chart. The diagnostics tests that Bhatt (2003) applied on Hindi data are ranked for their efficiency of attributing correct sub-class to verbs. The diffusion chart shows that a tripartite classification handles the issue of classification of intransitive verbs in a better manner than the classical binary approach. The tripartite classification is as follows: (1) Verbs that take animate subject and are compatible with adverb of volitionality; (2) Verbs that take animate subject but are not compatible with adverb of volitionality; and (3) Verbs that take inanimate subject. The classification is of immense advantage for various NLP tasks such as machine translation, natural language generation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.