The thousands of specialized structured file formats in use today present a substantial barrier to freely exchanging information between applications programs. We consider the problem of deducing such basic features as the whitespace characters, bracketing delimiter symbols, and self-delimiter characters of a given file format from one or more example files. We demonstrate that for sufficiently large example files, we can typically identify the basic features of interest.
Published in:
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Date of Conference: 19-22 Nov. 2003