Skip to Main Content
In many NLP applications, identifying similar information from a set of related documents is a common problem. In this paper, the similarity between two Chinese text units is determined by multiple features extracted from these units, including word statistical features, part of speech features, semantic features, word density feature and text discourse structure features. In addition, a statistical method based on logistic regression model is proposed to automatically fuse these features and calculate the similarity between text paragraphs. The experiment that compares this method with two popular used methods shows the effectiveness of this approach.