Skip to Main Content
With the exponentially growing volume of digital documents and internet content, it becomes very challenging to locate right information when desired. We heavily rely on search engines but most existing search tools are key-word based and they often return search results with low precision and recall. The emerging semantic tagging technology provides an automatic way to generate semantic tags from text. It has drawn more and more interest from text mining research communities. It is critical to study how to utilize semantic tags to improve text mining including clustering, which helps users to enhance their experience of searching and browsing documents. Unfortunately, most previous works on text clustering merely based on content information. A few recent researches take user-generated tags into account, however user generated tags are often noisy, inconsistent, redundant and lack of semantic information and hierarchical structure. In this work, we propose a Semantic Text Mining (STeM) framework to generate semantic tags for given documents and then utilize the semantic tags to improve text clustering. Different from the previous works, we represent a document by a combination of domains and high quality noun phrases. We investigate the performance of our methods in two different datasets and the results are evaluated by normalized mutual information. Experiment results demonstrated that our proposed method substantially outperformed the traditional Term Frequency-Inverse Document Frequency (TF-IDF) term vector based clustering. We find that incorporating semantic information into document representation is critical to improve the performance of text clustering.