Skip to Main Content
Recent researches have demonstrated the strong performance of hidden Markov models applied to information extraction-the task of populating database slots with corresponding phrases from text documents. It is well known that the training data coming from different sources is probably different in their formats although their contents are similar. In the previous information extraction researches, all the training data is mixed together to learn hidden Markov model parameters. But the training data as a whole is multicomponent. And it is difficult for using statistical learning technique to find optimal model parameters. We present a new algorithm using hidden Markov model for information extraction based on multiple templates, which first clusters the training data into multiple templates based on the format, then learns model structure parameters from the clustered training data and model emission probability parameters from the initial training data for information extraction. The experimental results show that the new algorithm outperforms the original one, which hasn't clustered the training data into multiple templates, in both precision and recall.