By Topic

Universal Communication Symposium (IUCS), 2010 4th International

Date 18-19 Oct. 2010

Filter Results

Displaying Results 1 - 25 of 84
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (892 KB)  
    Freely Available from IEEE
  • [Front and back cover]

    Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (1156 KB)  
    Freely Available from IEEE
  • [Title page]

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (1803 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (23 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): xi - xix
    Save to Project icon | Request Permissions | PDF file iconPDF (85 KB)  
    Freely Available from IEEE
  • Pseudo natural language vs. controlled natural language

    Page(s): K-1 - K-3
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (74 KB) |  | HTML iconHTML  

    Natural language is an indispensable means of communication. But it is also a serious barrier for communication, in particular between human and computers. Efforts have been made to overcome this barrier since long time ago. One of these efforts was to design controlled natural languages (CNL), which are subsets of natural languages, yet are easy to use for non-native users and can be processed by computers. First results of CNL for practical use were published in 2006-2007. On the other hand, we have designed and used another kind of limited natural language-the Pseudo Natural Language (PNL) since late eighties, which have been put in use successfully and have shown their ability of helping overcome the language barrier. This talk will compare CNL with PNL in detail and discuss future direction of this area. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Information analysis technology for Universal Communication

    Page(s): K-4 - K-5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (47 KB) |  | HTML iconHTML  

    The Internet functions as a social infrastructure and stores a huge amount of information. People use the Internet whenever they need information. The primary role of the Internet is changing from that of digital content sharing to that of knowledge sharing. NICT Knowledge Clustered Group is researching and developing information analysis technology for “Universal Communication”. In order to realize universal communications, we have to overcome several barriers. Information analysis is an essential technology to overcome a barrier of “Information Quality”. The overabundance that makes it difficult to extract the most useful information from the Internet is called the information explosion. The amount of consumer generating media (CGM) is increasing particularly rapidly, but the content may be neither valuable nor trusted. Ordinary web search engines cannot evaluate the content. However, if we can evaluate the credibility of information, credible digital content will be as easily accessed knowledge is on the Internet. We believe that information analysis technology to find higher quality information on Web content enables us to overcome a barrier of “Information Quality”.Research project of NICT Knowledge Clustered Group focuses on analysis of information credibility criteria and development of a knowledge cluster system. The former project addresses the information credibility criteria of web content using natural language processing (NLP) technology. The latter project aims to develop a knowledge cluster system that manages knowledge on enormous numbers of sites. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • New resources trigger new technologies

    Page(s): K-6 - K-7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (53 KB) |  | HTML iconHTML  

    Two decades ago large-scale corpora, as new language resources brought forth a new paradigm shift marked by the revival of empiricism. However, now some researchers including the beginner of the revival began to rethink: “what should they (next generation students) do when most of the low hanging fruit has been pretty much picked over?” or to predict that the weird state of computational linguistics without general linguistics should be brought to an end. The author anticipates a newly adjusting of paradigm is approaching. Again new language resources will trigger new NLP technologies. What are the new language resources like? The resources like HowNet will soon be brought into full play. Corpora helped us achieve shallow practice. Instead, HowNet will take us deeper and thus may help us reach the high-hanging fruit. After a brief overview of HowNet, the author will give an overall demonstration of three HowNet-based application tools that are all closely related to some immediate potential demands. They are: (1) Text-CT (vs. Text-X-ray), which can show all the senses of each word and expression of a text, rather than merely word strings or at most its POS; (2) Sense-Colony-Tester, which works on the basis of an sense colony activator and is able to measure the sense colony testing value of each sense in the text; (3) Morphological Decomposer, which can be used to deal with various types of OOVs by decomposing the morphological formation in both English and Chinese and extracting their meanings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): A-1 - A-3
    Save to Project icon | Request Permissions | PDF file iconPDF (193 KB)  
    Freely Available from IEEE
  • Joint tokenization, parsing, and translation

    Page(s): 1
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (28 KB) |  | HTML iconHTML  

    Summary form only given. Natural language processing is all about ambiguities. In machine translation, tokenization and parsing mistakes due to segmentation and structural ambiguities potentially introduce translation errors. A well-known solution is to provide more alternatives by using compact representations such as lattice and forest. In this talk, I will introduce a technique that goes beyond using lattices and forests, which integrates tokenization, parsing, and translation in one system. Therefore, tokenization, parsing, and translation can interact with and benefit each other in a discriminative framework. Experimental results show that such integration significantly improves tokenization and translation performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Domain adaptation for statistical machine translation in development corpus selection

    Page(s): 2 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (205 KB)  

    The performance of statistical machine translation (SMT) system is affected by model parameters (e.g. weights of feature functions), which are usually tuned on a development corpus. Most research done to date has focused on algorithms for tuning parameters. However, the selection of development corpus is lack of discussion. It is believed that the parameters trained on a proper corpus will improve translation performance. Instead of exploring new algorithms, this paper aims to select development corpus for tuning parameters according to the test set. We address this problem as domain adaptation and propose two methods based on information retrieval (IR) technique and text clustering (TC) technique, respectively. Experimental results show that both the methods yield more stable performance for tuning parameters than subjective selection of development corpus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discriminative reranking for SMT using various global features

    Page(s): 8 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (182 KB) |  | HTML iconHTML  

    In this paper, we propose to use various global features for discriminative reranking in an SMT framework. We employ an online large-margin based training algorithm for the structural output support vector machines based on the margin infused relaxed algorithm. Besides the standard features used, such as decoder's scores, source and target sentences, alignments and part-of-speech tags, we include sentence type probabilities, posterior probabilities and back translation features for reranking. These features have been proved to be useful in other approaches in statistical machine translation but it is the first attempt to apply them in reranking. Our experimental results using 160K BTEC corpus show an improvement of 1-4 BLEU percentage points on Japanese/Chinese to English translation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Head- and relation-driven tree-to-tree translation using phrases in a monolingual corpus

    Page(s): 15 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (271 KB)  

    We propose an extension of context-based machine translation (CBMT) to deal with distant language pairs such as Japanese and English, incorporating a syntactic transfer approach. Our method uses a tree structure where a node is a head and an edge is a dependency with a relation between heads. We retrieve partial trees from a monolingual corpus using a bilingual dictionary to generate candidate translation phrases, and build a tree by overlapping their heads. Word orders of a verb and its elements are decided based on a structural monolingual corpus in the target language. In our experiment with Japanese to English patent translation, human evaluation results showed that our method was better than phrase-based and hierarchical phrase-based statistical machine translation methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Structuring and manipulating hand-drawn concept maps

    Page(s): 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (29 KB) |  | HTML iconHTML  

    Concept maps are an important tool to knowledge organization, representation, and sharing. Most current concept map tools do not provide full support for hand-drawn concept map creation and manipulation, largely due to the lack of methods to recognize hand-drawn concept maps. We propose a structure recognition method. Our algorithm can extract node blocks and link blocks of a hand-drawn concept map by combining dynamic programming and graph partitioning and then build a concept-map structure by relating extracted nodes and links. We also introduce structure-based intelligent manipulation technique of hand-drawn concept maps. Evaluation shows that our method has high structure recognition accuracy in real time, and the intelligent manipulation technique is efficient and effective. Based on the technique, a note-taking tool `IdeaNote' is developed. It not only supports natural note taking, but also supports efficient note editing. Besides research related to hand-drawn concept map, I will also introduce our work on other ink computing techniques and the systems we have developed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A geometric approach to approximate continuous k-median query

    Page(s): 24 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB) |  | HTML iconHTML  

    We revisit the classic k-median problem in continuous distributed model. The rapid advance in electronic miniaturization, wireless communication and position technologies makes a significant contribution to pervasive applications of continuous distributed model. Data sets acquired in continuous distributed model are automatically and continuously updated, or even distributed over a wide area in typical cases. The sequence of k-median at each time stamp in the continuous distributed model forms a k-median series, which is called continuous k-median. Our main idea is to transform continuous k-median problem to continuous k-median query, which applies a selection operation on continuous k-median. Because the result of this selection is a subset of k-median series, time and communication efficiency in the continuous distributed model can be achieved. The continuous k-median query provides an insightful structure of data sets along time dimension and widely applied in various cases such as location-based services, sensor network monitor, and etc. In this paper, the time-efficiency of continuous k-median query in a central paradigm is first studied where an efficient indicator function is designed to suppress unnecessary re-evaluations. Then, communication-efficiency of continuous k-median query is addressed in a distributed paradigm where a geometric approach is applied to suppress unnecessary communications between nodes. Our approach to continuous k-median query distinguishes itself in two aspects. First, the indicator function is built on the aggregation distribution of data sets instead of prevailing safe region of individuals and time-efficiency can therefore be achieved. Second, a geometric approach is explored so that a single local node can trigger a re-valuation and therefore communication-efficiency can be obtained. Experiments are done to empirically demonstrate the time and communication efficiency of our approach on various data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A relation-based services management mechanism for service computing

    Page(s): 32 - 39
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (252 KB) |  | HTML iconHTML  

    In this paper we propose a service management mechanism based on a relational model, in which all services to be managed are represented, and operations for services are relational operations using SQL. To define the model, we use Web Services Description Language (WSDL) descriptions and extract information. To use SQL for relational operations for service management, we introduce a new operator to check assignability. Through this definition and introduction we can manage services by relational operations, such as service invoke, service collaboration, and service composition. It also becomes easy to collaborate services and data in relational databases since users can access both via a single SQL query. We present an example of implementation built on the Knowledge Cluster System, a type of service computing environment we are developing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Services in the Cloud Computing era: A survey

    Page(s): 40 - 46
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (161 KB) |  | HTML iconHTML  

    Cloud Computing is becoming a well-known buzzword nowadays. As a brand new infrastructure to offer services, Cloud Computing systems have many superiorities in comparing to those existed traditional service provisions, such as reduced upfront investment, expected performance, high availability, infinite scalability, tremendous fault-tolerance capability and so on and consequently chased by most of the IT companies, such as Google, Amazon, Microsoft, Salesforce.com. Based on their overwhelming predominance in traditional service provisions and capital accumulation, most of these IT companies have more chance to adapt their services into such a new environment earlier, say Cloud Computing systems. On the other hand, a large number of new companies are spawned with competitive services relayed on those provided Cloud Computing systems. In terms of their provisions, we divide those services into six categories in this paper, say Data as a Service (Daas), Software as a Service (SaaS), Platform as a Service (PaaS), Identity and Policy Management as a Service (IPMaaS), Network as a Service (NaaS), Infrastructure as a Service (IaaS). Detailed analysis to these services are provided, as well as those companies which provide the corresponding service categories. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HMM based speech synthesis with Global Variance Training method

    Page(s): 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (32 KB) |  | HTML iconHTML  

    Although Hidden Markov Model based speech synthesis has been proved to have good performance,there are still some factors which degrade the quality of synthesized speech: vocoder,model accuracy and over-smoothing. Experimental results show that over-smoothing in frequency domain mainly affect the quality of synthesized speech whereas over-smoothing in time domain can nearly be ignored. Time domain over-smoothing is generally caused by model structure accuracy problem and frequency domain over-smoothing is caused by training algorithm accuracy problem. ML-estimation based parameter training algorithm causes distortion of perception in speech synthesis. The talk will introduce a Global Variance (FV) based Training method into the HTS training structure. The new method tries to enlarge the variance of the spectrum and FO generation. The experiments show that the method improves the synthesizing performance both in voice quality and the expressiveness. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A preliminary exploration on tone error detection in Mandarin based on clustering

    Page(s): 48 - 51
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (105 KB) |  | HTML iconHTML  

    This paper addresses the ongoing issue of tone error detection for Mandarin Computer Assisted Language Learning (CALL) systems. A novel approach based on clustering is proposed. The selection of different contextual tonal factors including Uni-tone, LBi-tone and RBi-tone are explored. Experimental results show that our proposed approach is feasible, obtaining an Equal Error Rate (EER) of 18.75% by LBi-tone and with a 2.35% reduction compared to Uni-tone on a real corpus of nonnative speaker of Mandarin. Additionally, effects of contextual tonal factors on disyllabic words are investigated in this work. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Korean pronunciation variation modeling with probabilistic Bayesian networks

    Page(s): 52 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (139 KB) |  | HTML iconHTML  

    In Korean language, a large proportion of word units are pronounced differently from their written forms due to an agglutinative and highly inflective nature having severe phonological phenomena and coarticulation effects. This paper reports on an ongoing study of Korean pronunciation modeling, in which the mapping between phonemic and orthographic units is modeled by a Bayesian network (BN). The advantages of this graphical model framework is that the probabilistic relationship between these symbols as well as additional knowledge sources can be learned in a general and flexible way. Thus, we can easily incorporate various additional knowledge sources from different domains. In this preliminary study, we start with a simple topology where the additional knowledge only includes the preceding and succeeding contexts of the current phonemic unit. In practise, this proposed BN pronunciation model is applied on our syllable-based Korean large-vocabulary continuous speech recognition (LVCSR) system, where we construct the speech recognition task as a serial architecture composed of two independent parts. The first part is to perform standard hidden Markov model (HMM)-based recognition of phonemic syllable units of the actual pronunciation (surface forms). By this way, the lexicon dictionary and out-of-vocabulary rates can be kept small, while avoiding high acoustic confusability. In the second part, the system then transforms the phonemic syllable surface forms into the desirable Korean orthography eumjeol of a recognition unit, by utilizing the proposed BN pronunciation model. Experimental results show that the proposed BN model can successfully map the phonemic syllable surface forms to eumjeols transcription with more than 97% accuracy on average. It also revealed that it could help to enhance our Korean LVCSR system, and gave about 25.53% absolute improvement on average with respect to baseline orthographic syllable recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving spontaneous English ASR using a joint-sequence pronunciation model

    Page(s): 58 - 61
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (131 KB) |  | HTML iconHTML  

    The performance of English automatic speech recognition systems decreases when recognizing spontaneous speech mainly due to occurring multiple pronunciation variants in the utterances. Previous approaches address the multiple pronunciation problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation effects induced by the pronunciation of the whole sentence are not considered yet. In this paper we attempt to recover the original word sequence from the spontaneous phoneme sequence by applying a joint sequence pronunciation model. Hereby, the whole word sequence and its effect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this preliminary study, first the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the joint-sequence approach will map from the phoneme to the word level. Our experiments use Buckeye as spontaneous speech corpus. The results show that the proposed method improves the word accuracy consistently over the conventional recognition system. The most improved system achieves up to 12.1% relative improvement to the baseline speech recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Storage and index support for data intensive web applications

    Page(s): 62 - 68
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (366 KB) |  | HTML iconHTML  

    In this paper, a system named as DisGR, for Distributed Graph Repository, that is designed and developed for supporting Chinese Web related research, is introduced. The system is designed based on a graph data model, TGM (for Tagged Graph Model), that is designed for representing Web data, especially forum and BBS data. DisGR supports the query language TGM-L that aims at analytical tasks for TGM data. For high-scalability and availability purpose, DisGR is designed for clusters with shared-nothing architecture. DisGR has several characteristics such as column-based storage, descriptive language support, and flexible user-defined function support. DisGR is different to other database systems with similar purpose in three perspectives. First, catalog is maintained by a set of servers connected via a DHT overlay. Second, signatures with different granularities are used for data distribution and query optimization. Last but not the least, update is supported via timestamps and regularily reorganization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Materialized view maintenance in columnar storage for massive data analysis

    Page(s): 69 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (379 KB) |  | HTML iconHTML  

    Data-intensive computing becomes a buzz word nowadays, where constant data for current operational processing and historical data for massive analysis are often separated into two systems. How to keep the historical data for analysis (often in a materialized view manner) consistent with their data sources (often in the operational databases) is the main problem to be solved imperatively. In this paper, we proposed a novel method for data consistency maintenance between the data located in the two systems. Two basic operators (i.e., insertion and deletion) for consistency maintenance are provided as well as their implementations in the new environment of column-oriented storage on large-scale data analysis platform for efficient processing. Two data consistency models (i.e., eventual consistency model and timeline-based consistency model) are proposed to tradeoff data consistency for processing efficiency. Our extensive experimental evaluation also proves the efficiency and effectiveness of our proposed methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimization of multi-join query processing within MapReduce

    Page(s): 77 - 83
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (286 KB) |  | HTML iconHTML  

    MapReduce is a programming model which is usually applied to process large-scale data. Many tasks can be implemented under the framework, such as data processing of search engines and machine learning. However, there is no efficient support for join operation in current implementations of MapReduce. Former work has studied Map-Reduce-Merge for join operator, however, because of the time cost in the Reduce phase, we argue it is better to omit the Reduce procedure along with the cost it brings for join implementation. In this paper, we design and implement a join algorithm on relational data in a MapReduce environment. Meanwhile, we present a method for join operator over many relations. We conduct a series of experiments to verify the effectiveness and efficiency of proposed methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Searching XML data by SLCA on a MapReduce cluster

    Page(s): 84 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (585 KB) |  | HTML iconHTML  

    XML keyword search is a popular topic in research field, and the Smallest Lowest Common Ancestor (SLCA) concept is fundamental for XML keyword search algorithms. With the rapid growth of XML data in internet, we are confronted with big data issues, it's becoming a new research direction for managing massive XML data now. Conventional centralized data management technologies are limited in the aspects of efficiency, throughout and maintenance cost. MapReduce framework is a recent trend to process large-scale data. It is implemented on clusters built by numbers of business machines, to conquer limitations mentioned above by parallel computation. In this paper, we provide a SLCA-based keyword search implementation for large-scale XML data sets on a MapReduce cluster. Main steps of our implementation include XML data partition, parse and sort, index setup and SLCA computation. We conduct some experiments to evaluate the effectiveness of the proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.