The diffusion of the World Wide Web (WWW) on Internet, and the consequent increase in production and exchange of textual information demand the development of effective retrieval systems. The typical textual document on the WWW is defined through the HTML (HyperText Marking Language), in which the document is structured in subparts by means of tags. In this paper, an approach to the indexing of HTML documents is proposed, based on the assumption that tags provide the text with different levels of importance with respect to the document content. A significance degree of an index term can then be computed by weighting the term occurrences according to the “importance” associated with the tags in which they appear. In this way, the numeric significance degree of a term takes into account the explicit author's indications of the different importance of the term in the document
Published in:
Fuzzy Systems, 1996., Proceedings of the Fifth IEEE International Conference on
(Volume:1
)
Date of Conference: 8-11 Sep 1996