Conferences >2018 IEEE International Confe...

Doc2Cube: Allocating Documents to Text Cube Without Labeled Data

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora w...Show More

Metadata

Abstract:

Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.

Published in: 2018 IEEE International Conference on Data Mining (ICDM)

Date of Conference: 17-20 November 2018

Date Added to IEEE Xplore: 30 December 2018

ISBN Information:

ISSN Information:

DOI: 10.1109/ICDM.2018.00169

Conference Location: Singapore

Contents

I. Introduction

Text cube is a multidimensional data structure with text documents residing in, where the dimensions correspond to multiple aspects (e.g., topic, time, location) of the corpus. Text cube analysis has been demonstrated as a powerful text analytics tool for a wide spectrum of applications in bioinformatics, healthcare, and business intelligence. For example, by organizing a news corpus into a three-dimensional topic-time-location cube, decision makers can easily browse the corpus and retrieve desired articles with simple queries (e.g., (Sports, 2017, USA)). Any text mining primitives, e.g., sentiment analysis, can be further applied on the retrieved data for gaining useful insights. As another example, one can organize a corpus of biomedical research papers into a neat cube structure based on different facets (e. g., disease, gene, protein). Such a text cube allows people to easily identify relevant papers in biomedical research and acquire useful information for disease treatment. Fig. 1:

Text cube construction on a news corpus with three dimensions: Topic, location and time. Each document needs to be assigned with one label in each of the three dimensions.

References is not available for this document.

Doc2Cube: Allocating Documents to Text Cube Without Labeled Data

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Doc2Cube: Allocating Documents to Text Cube Without Labeled Data

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?