Scheduled System Maintenance:
On May 6th, system maintenance will take place from 8:00 AM - 12:00 PM ET (12:00 - 16:00 UTC). During this time, there may be intermittent impact on performance. We apologize for the inconvenience.
By Topic

Natural Language Processing and Knowledge Engineering, 2008. NLP-KE '08. International Conference on

Date 19-22 Oct. 2008

Filter Results

Displaying Results 1 - 25 of 73
  • TE4AV: Textual Entailment for Answer Validation

    Publication Year: 2008 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (242 KB) |  | HTML iconHTML  

    The textual entailment (TE) task consists of discovering unidirectional semantic inferences between the meanings of two text snippets. Taking advantage of this, in this paper we propose using the TE system as an answer validation (AV) engine to improve the performance of question answering (QA) systems and help humans in the assessment of QA systems' outputs. To achieve these aims and in order to assess the overall performance of our TE system and its application in QA tasks, two evaluation environments are presented: pure entailment and QA-response evaluation. The former uses the corpus and methodology of the PASCAL recognizing textual entailment challenges, whereas for the latter we use the data provided by the answer validation exercise competition within the cross-language evaluation forum. The system, the evaluations environments and the experiments developed are discussed throughout the paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic clustering of part-of-speech for vocabulary divided PLSA language model

    Publication Year: 2008 , Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (245 KB) |  | HTML iconHTML  

    PLSA is one of the most powerful language models for adaptation to a target speech. The vocabulary divided PLSA language model (VD-PLSA) shows higher performance than the conventional PLSA model because it can be adapted to the target topic and the target speaking style individually. However, all of the vocabulary must be manually divided into three categories (topic, speaking style, and general category). In this paper, an automatic method for clustering parts-of-speech (POS) is proposed for VD-PLSA. Several corpora with different styles are prepared, and the distance between corpora in terms of POS is calculated. The "general tendency score" and "style tendency score" for each POS are calculated based on the distance between corpora. All of the POS are divided into three categories using two scores and appropriate thresholds. Experimental results showed the proposed method formed appropriate clusters, and VD-PLSA with acquired categories gave the highest performance of all other models. We applied the VD-PLSA into large vocabulary continuous speech recognition system. VD-PLSA improved the recognition accuracy for documents with lower out-of-vocabulary ratio, while other documents were not improved or slightly descended the accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of Relational Databases and information retrieval libraries on Turkish text retrieval

    Publication Year: 2008 , Page(s): 1 - 8
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (266 KB) |  | HTML iconHTML  

    The present work covers a comparison of the text retrieval performances of relational databases and IR Systems over a TREC-like test collection for Turkish. The effects of language specific preprocessing and different query lengths for different information retrieval systems are investigated. Showed that language specific preprocessing improves retrieval performance for all systems and also Relational Databases are slower with longer queries. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • To extract Ontology attribute value automatically based on WWW

    Publication Year: 2008 , Page(s): 1 - 7
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1492 KB)  

    Attribute value is among the most important information to describe ontology. However, few researches have been done about attribute values extraction so far. This paper proposes a method of extracting ontology attribute values automatically based on WWW. Firstly, a method based on a seeds set is described about interaction between relevant sentences selection including attribute values and attribute values extraction, so that we can extract and expand the target attribute value set by the redundancy of WWW. Secondly, we construct the seeds set with an automatic method instead of by hand. Finally, we build hierarchical clusters of the candidate attribute values to gain more accurate and complete results. Experiments have been done to compute the precision and recall. Also automatically enriched ontology information is used in Webpage content extraction to show its usage. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bootstrapping word alignment by automatically generated bilingual dictionary

    Publication Year: 2008 , Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    This paper presents a new approach to improve the word alignment. Building a bilingual dictionary is one of the main applications for word alignment. However, the research of using the bilingual dictionary to improve the word alignment is not enough. There are two bottlenecks. The first is that large bilingual dictionary is hard to get. The second is that the normal approach of using bilingual dictionary does not make full use of the dictionary. We designed a bootstrapping algorithm to conquer the bottlenecks, achieving a good result. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic keywords-based duplicated web pages removing

    Publication Year: 2008 , Page(s): 1 - 7
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (245 KB) |  | HTML iconHTML  

    Because of many duplicated web pages existing on the web, search engines need to find and remove them, not only for saving process time and hardware resource, but also for ensuring that users can get the result information without many replicas. In this paper, we propose a method to find and remove duplicated Chinese Web pages for search engine. First we describe a scheme based on semantic keywords combined with sentence overlapping, and then show an implemented prototype, with the experimental results that suggest the prototype work well under a proper setting. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recognizing location names from Chinese texts based on Max-Margin Markov Network

    Publication Year: 2008 , Page(s): 1 - 7
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (261 KB) |  | HTML iconHTML  

    This paper presents a novel method of recognizing location names from Chinese texts based on max-margin Markov Network (M3Net) owing to its ability to exploit very high dimensional feature spaces (using the kernel trick) while at the same time dealing with structured data compared with Support Vector Machine (SVM) and conditional random fields (CRFs). In our model, the character itself, character-based part-of-speech (POS) tag, the information whether a character appears in the location name characteristic word table and context information are extracted as the features. The F-measure is up to 90.57% based on 1-order M3Net which is better than that based on either SVM or CRFs in open test on MSRA dataset. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RCSUM: To build a summarization system directly generating summaries with evaluation metrics

    Publication Year: 2008 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3302 KB) |  | HTML iconHTML  

    Several automated metrics have been adopted by document understanding conference (DUC) in recent years as they offered great advantages in the area of summarization. However, there're still no evaluation metrics which can be directly used to generate Summaries, mainly because that human reference summaries are indispensable for these metrics. Here we report that our group first discovered and developed RCSUM, a summarization system that can generate summaries with evaluation metrics. RCSUM was developed from ROUGE-C, a new fully automated evaluation metrics recently developed in our group, which can use ROUGE in an absolutely manual-independent way, and reserved capabilities of distinguishing good or bad summaries. Experiments conducted on the 2001 to 2005 DUC data showed that, evaluated by ROUGE-2 as well as ROUGE-SU4, our system performed very well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward a Robust data fusion for document retrieval

    Publication Year: 2008 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (122 KB) |  | HTML iconHTML  

    This paper describes an investigation of signal boosting techniques for post-search data fusion, where the quality of the retrieval results involved in fusion may be low or diverse. The effectiveness of data fusion techniques in such situation depends on the ability of the fusion techniques to be able to boost the signals from relevant documents and reduce the effect of noise that often comes from low quality retrieval results. Our studies on Malach spoken document collection and HARD collection have demonstrated that CombMNZ, the most widely used data fusion method, does not have such ability. We, therefore, developed two versions of signal boosting mechanisms on top of CombMNZ, which result in two new fusion methods called WCombMNZ and WCombMWW. To examine the effectiveness of the two new methods, we conducted experiments on Malach and HARD document collections. Our results show that the new methods can significantly outperform CombMNZ in combining retrieval results that are low and diverse. When the tasks are to combine retrieval results that are in similar quality, which have been the scenarios that CombMNZ are applied often, the two new methods still can obtain often better, sometimes significantly, fusion results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic similarity computation based on HowNet2008

    Publication Year: 2008 , Page(s): 1 - 5
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (206 KB) |  | HTML iconHTML  

    Semantic similarity is a fundamental concept and widely researched and used in the fields of natural language processing. By analyzing the definition of the concept in HowNet2008, this paper proposes a new method of semantic similarity calculation. The concepts are classified into three classes: simple concept; complex concept and combined concept. To different concept, we design different method and then transform the similarity calculation of concept into the similarity calculation of the sememe. The similarity of the smeme is computed by the hyponymy of the sememe in the sememe tree. Experiments show the new approach is effective to the similarity calculation and out-performs the conventional computed approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Answer extraction based on query expansion in Chinese question answering system

    Publication Year: 2008 , Page(s): 1 - 4
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (199 KB) |  | HTML iconHTML  

    When using natural language question to retrieval document, query expansion are the key factors that affect its retrieval performance. By analyzing the traditional query expansion method, this paper puts forward a query expansion method based on set theory for answering document retrieval. In order to verify the validity of the method, a similarity calculation method for question and candidate answer sentences is proposed to extract the answer. The major query expansion terms and the minor query expansion terms are taken into consideration with different weight in the similarity calculation. The experiment results show that the performance makes substantial improvement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Emotion recognition from blog articles

    Publication Year: 2008 , Page(s): 1 - 8
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB) |  | HTML iconHTML  

    Suicide of college students has been a universal phenomenon in the world. And the phenomenon has become more and more sever because of the complex and drastic competitions. With the popularization of Internet and the development of information processing technologies, a lot of people have established their own blog Websites to write down their experiences and express their feelings at times. It will be very helpful if the computer can recognize the emotions expressed in blog pages automatically. And then it will be convenient for teachers or psychological consultants to monitor the affective information of college students and take measures for the depression prevention when necessary. Owing to the advances in affective computing and natural language processing, researches have begun to pay more attention to the emotion recognition in NLP all over the world. This paper outlines the approach we have developed to construct a blog emotion-recognizing system. It is based on the lexical contents of words and structural characteristics of blog articles. For the emotion computing of articles, two methods are proposed and the experimental results are compared and analyzed. Finally, the implications of the results are discussed for the future's direction of the research. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Resources for Nepali Word Sense Disambiguation

    Publication Year: 2008 , Page(s): 1 - 5
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB) |  | HTML iconHTML  

    Word sense disambiguation (WSD) is a process of identifying proper meaning of words that may have multiple meanings. It is regarded as one of the most challenging problems in the field of natural language processing (NLP). Nepali Language also has words that have multiple meanings, thus giving rise to the problem of WSD in it. In this paper, we investigate the impact of NLP resources like morphology analyzer (MA) and machine readable dictionary (MRD) in ambiguity resolution. Our results show that the accuracy in WSD is better with the availability of NLP resources like morph analyzer, MRD etc. Lesk algorithm has been used to solve WSD problem using a sample Nepali WordNet containing few sets of Nepali nouns and the system is able to disambiguate these nouns only. The system was tested on a small set of data with limited number of nouns. The accuracy reading was between 50% - 70% depending on the sample data provided. When the same data was tested through manual morph analysis, the accuracy was seen to be considerably high (80%). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • NP tree matching for English-Chinese translation of patent titles

    Publication Year: 2008 , Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (275 KB)  

    This paper proposes a method of NP tree matching to realize the translation of English-Chinese patent titles. Firstly a bilingual example database for patent titles is built. English parse trees are produced by English parser, forming NP tree database. The input patent title to be translated is firstly parsed into a tree. Then NP trees are searched for which match with the input NP tree in NP tree database. If similar NP trees exist, HowNet is used to find the best NP tree by calculating word semantic similarity. Final translations are obtained through calculating cohesion of candidate words. If there are no similar NP trees, subtrees that match the input NP tree are searched for and translations generate by subtree substitutions recursively. Experimental results show that our method outperforms a baseline Pharaoh, by using BLEU evaluation system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The formalization of ‘temporal adverbials +ZHE imperfective’ sentences

    Publication Year: 2008 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (308 KB) |  | HTML iconHTML  

    dasiaZHEpsila imperfective has long time been a burning problem with linguistic research. However, up till now, few studies on the semantic meanings of dasiatemporal adverbials+ZHE imperfectivepsila sentences and their formalizations have been done. This paper addresses the formalization of dasiaZHEpsila imperfective and its combinations with temporal adverbial by an automatic parsing using CTT(Copenhagen Tree Tracer). This paper first advances that dasiatemporal adverbials +ZHE imperfectivepsila structure has three types of viewpoints by a corpus-based study. Then, it reveals their formalized structures. Finally, it presents the formalizations of dasiaZHEpsila imperfective and its combinations with temporal adverbials. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic identification of non-anaphoric anaphora in spoken dialog

    Publication Year: 2008 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    Identification of non-anaphoric anaphora is an important step towards a full anaphora resolution. In this paper, we present an automatic identification approach for this task. In our work, some novel features are proposed, which are based on dependency grammars, surrounding words and their POS tags. All the features are automatically extracted using a part-of-speech (POS) tagger and a dependency parser. Our experiments are on a commonly available dialogue corpus, Trains-93. Several machine learning algorithms are used in the experiments, including CME, CRF and SVM. Results show that compared to the approaches used in the previous work, our algorithm is simpler and achieves a higher accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Measuring word polysemousness and sense granularity at a language level

    Publication Year: 2008 , Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (250 KB)  

    Word sense acquisition and distinction are key issues for both lexicography and lexical semantic processing. However, it is quite difficult to automatically acquire word senses and to further evaluate the results with the lexica, which more likely bear the different findings of word sense distinction and granularity. In this paper, we'd like to put forward the idea of measuring word polysemousness and sense granularity at a language level. Two methods, viz. MECBC and TIEM, are at first employed as attempts to extract Chinese word senses from the corpora. Automatic word senses mapping to the lexica and evaluation of the results are devised and realized afterwards. Our experiments shows a rather fine fitness of Chinese word polysemousness between the results and the lexica at the whole language level. Comaprison of sense granularity between different lexical semantic resources can hence be made on a sound judgment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting salient word dependency for Chinese NP identification: A study on classifier noun phrase

    Publication Year: 2008 , Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (262 KB)  

    NP identification is a challenging subtask of NLP. The reported literatures mainly focus on base noun phrase and maximal-length noun phrase, and deal with them as a sequence labeling problem. In this paper, unlike existing perspective, we concentrate on a special subcategory of Chinese NP, classifier noun phrase (CNP), and present a new approach which uses salient word dependency, such as classifier-noun collocation, for CNP identification. The experiment result is encouraging. Our study shows that salient relations between words should be fully utilized in NP identification as well as other NLP applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automated essay scoring using set of literary sememes

    Publication Year: 2008 , Page(s): 1 - 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (196 KB) |  | HTML iconHTML  

    Automatic essay scoring system is a very important research tool for many educational studies. Many researches indicate that AES systems should be able to analyze semantic characteristics of an essay and include more such features to score essays. This paper makes an assumption: some concepts that can be regarded as literary concepts would only be utilized by skillful writers. However, it is a difficult task to extract literal concepts due to small size of training corpora. This work uses a semantic network tool to overcome the problem. The concepts in essays can be transformed into sememes using the tool and literary concepts are also transformed into literary sememes. This work introduces a method which makes use of the literary sememes in an essay to score the essay. Experimental results show that the accuracy of the proposed method for Chinese essays is comparable to those as achieved by several current English AES systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrate statistical model and lexical knowledge for Chinese multiword chunking

    Publication Year: 2008 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (235 KB) |  | HTML iconHTML  

    Multiword chunking is designed as a shallow parsing technique to recognize external constituent and internal relation tags of a chunk in sentence. In this paper, we propose a new solution to deal with this problem. We design a new relation tagging scheme to represent different intra-chunk relations and make several experiments of feature engineering to select a best baseline statistical model. We also apply outside knowledge from a large-scale lexical relationship knowledge base to improve parsing performance. By integrating all above techniques, we develop a new Chinese MWC parser. Experimental results show its parsing performance can greatly exceed the rule-based parser trained and tested in the same data set. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Utterance templates merging in automaton-based dialogue systems

    Publication Year: 2008 , Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (220 KB) |  | HTML iconHTML  

    Many of the implemented dialog systems in industry are based on a state automaton. Most of these systems rely on predefined messages, where a message is an ordered set of utterance templates, in order to produce the output message. The automaton-based dialog manager computes the correspondence of a particular predefined message to a given user's request. However, a rather common event in the dialog system's workflow is the dialog manager's highlighting of multiple messages in response to a user's request. In order to produce an appropriate output message, the templates of all messages need to be merged and restructured. In this paper, we introduce a natural language generation (NLG) module to the automaton based dialog system in order to perform this task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A chunk-based reordering model for phrase-based SMT systems

    Publication Year: 2008 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (217 KB) |  | HTML iconHTML  

    This paper proposed a novel reordering model based on the reordering of source language chunks. This model is used as a preprocessing step of phrase-based translation models and could be well integrated with them. At the same time, as a chunk-based model, syntax information could be concerned in the process of reordering while the entire parsing of the source sentence is not required. Two experiments were carried out and the results showed that the proposed model could improve the performance of a phrase-based statistical machine translation (SMT) system greatly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Latent Semantic Indexing with concepts mapping based on domain ontology

    Publication Year: 2008 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    ldquoCurse of dimensionalityrdquo is a common problem in the area of information retrieval. It was verified that points in a vector space are projected to a random subspace of suitably high dimension, and then the distances between the points are approximately preserved. Although such a random projection can be used to reduce the dimension of the document space, it does not bring together semantically related documents. Latent Semantic Indexing (LSI) projects documents to lower dimensional LSI space from higher dimensional term space with singular-value decomposition (SVD) for the purpose of reducing the dimensions of the document space and bringing together semantically related documents. But the computation time of SVD is a bottleneck because of the higher dimensions of documents. In this paper, a novel method of dimension reduction for improving LSI is provided. A term-to-concept projection matrix based on domain ontology was created in this method. This way documents were projected to lower dimensional concept space by the projection matrix. LSI pre-computation was performed not on the original term by document matrix, but on the lower dimensional concept by document matrix at great computational savings. Experiments indicate that this method improves the efficiency of LSI. And the similarity judgment between documents is not disturbed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A question answering system based on VerbNet frames

    Publication Year: 2008 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (201 KB) |  | HTML iconHTML  

    The precision of the answer is now essential for a question answering system, because of the large amount of free texts on the Internet. Attempting to achieve a high precision, we propose a question answering system supported by case grammar theory and based on VerbNet frames. It extracts the syntactic, thematic and semantic information from the question to filter out unmatched sentences in semantic level and to extract answer chunk (a phrase or a word that can answer the question) from the answer sentence. VerbNet is applied in our system to detect the verb frames in question and candidate sentences, so that the syntactic and thematic information as well as semantic information can be therefore obtained. Our question answering system works well especially for answering factoid questions. The experiments show that our approach is able to filter out semantically unmatched sentences effectively and therefore rank the correct answer (s) higher in the result list. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The effects of high quality translations of named entities in cross-language information exploration

    Publication Year: 2008 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (269 KB) |  | HTML iconHTML  

    Named entities (NEs) are the expressions in human languages that explicitly link notations in languages to the entities in the real world. They play important role in cross-language information retrieval (CLIR) because most users' requests have been found to have NEs, and majority of out-of-vocabulary terms are NEs. Therefore, missing their translations has a significant impact to the retrieval effectiveness. In this paper, we examined the effect of high quality translations of NEs in event driven information exploration, where the existence of NEs is even more common. With the focus on the effect of NE translations obtained by using information extraction (IE) techniques, we conducted several experiments using TDT test collections. Our results demonstrate that NEs and their translations play critical roles in improving CLIR effectiveness, and it makes positive impact in CLIR to use high quality translations of NEs obtained by IE techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.