Abstract:
Enterprises often have a large number of databases and other sources of tabular data with columns full of domain-specific jargon (e.g. alpha-numeric codings, undeclared a...Show MoreMetadata
Abstract:
Enterprises often have a large number of databases and other sources of tabular data with columns full of domain-specific jargon (e.g. alpha-numeric codings, undeclared abbreviations, etc) which usually require domain experts to decode. Due to the jargon-specific content of the tables, no pre-trained language model such as Wiki2Vec [21] can be applied readily to encode the cell semantics due to absence of unique jorgan words or alpha-numeric codes in the model vocabulary. We propose a deep learning based framework that is ideally suited for serverless computing environment, and that 1) uses a new tokenization method, called Cell-Masking, 2) encodes the semantics of the cells into contextual embedding that exploits the locality features in tabular data, called Cell2Vec, and 3) an attention-based neural network, called TableNN, that provides a supervised learning solution to classify cell entries into predefined column classes. We apply the proposed method on three publicly available datasets of varying data sizes, from different industries. Cell-Masking provides an order of magnitude lower loss value and quickest convergence for cell embedding generation. In Cell2Vec, we demonstrate that the inclusion of row and column context improves the quality of embeddings by better loss curve convergence and improvement in accuracy by 5.4% on the BTS dataset [3].
Date of Conference: 15-18 December 2021
Date Added to IEEE Xplore: 13 January 2022
ISBN Information: