Conferences >SICE Annual Conference 2011

Automated extraction of non -tagged headers in webpages by decision trees

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The guideline #5.2a in the JIS X 8341-3 reccomends to “represent headings with heading elements instead of difference in font size, etc”. Thus, in checking webpage access...Show More

Metadata

Abstract:

The guideline #5.2a in the JIS X 8341-3 reccomends to “represent headings with heading elements instead of difference in font size, etc”. Thus, in checking webpage accessibility, headers that are not tagged with heading tags (<;h1>;-<;h6>;) should be extracted as problems. In this paper, we propose a method for the extraction. Our idea is to let a machine learning method to automatically derive extraction rules from problem instances on the web. We define 26 attributes of HTML elements for deriving the rules. Values of these attributes are calculated by parsing the HTML source of the webpage. Accuracy of our method was evaluated by 10-fold cross validations with the data we collected from the web. The accuracy was 85-88% in average in terms of F-measure. Non <;h>;-tagged image headers were slightly better discriminated than non <;h>;-tagged text headers.

Published in: SICE Annual Conference 2011

Date of Conference: 13-18 September 2011

Date Added to IEEE Xplore: 27 October 2011

ISBN Information:

Conference Location: Tokyo, Japan

Contents

References is not available for this document.

Automated extraction of non -tagged headers in webpages by decision trees

Abstract:

Metadata

Abstract:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Automated extraction of non -tagged headers in webpages by decision trees

Alerts

Abstract:

Metadata

Abstract:

Authors

Figures

References

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?