Skip to Main Content
This paper proposes a semantic-block-based hidden Markov model. Semantic block is segmented from the elicited information of various websites based on their characteristic of semi-structure. The model adopts semantic block as the basic element in an observation sequence, replacing the original element — word, in order to improve the accuracy and efficiency of the transition matrix. Also, it optimizes the observation probability distribution and the estimation accuracy of state transition sequence by adopting the “voting strategy” and modifying Viterbi algorithm. In the end, the experiment results are able to show that the new model and algorithms give satisfying performance in recall and precision for web information extraction.