基于分块重要度和二维条件随机场的Web信息抽取

南京大学学报(自然科学版) ›› 2014, Vol. 50 ›› Issue (1): 79–.

基于分块重要度和二维条件随机场的Web信息抽取

吴秦，胡丽娟，梁久祯*

出版日期:2014-01-16 发布日期:2014-01-16
作者简介:江南大学物联网工程学院，无锡，214122
基金资助:
国家自然科学基金（61202312, 61170121）

Web Information extraction based on block importance model and 2D conditional random fields

Wu Qin, Hu Lijuan, Liang Jiuzhen

Online:2014-01-16 Published:2014-01-16
About author:School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, China

摘要/Abstract

摘要： 网页分块方法使得Web信息抽取的单位由原来的页面缩小为分块。结合分块重要度模型与二维条件随机场的优点，提出一种Web对象信息抽取方法。该方法利用分块重要度模型对网页分块进行重要度标注，过滤掉大量与主题无关信息，更加准确的定位待抽取信息的位置。二维条件随机场模型相比传统的线性条件随机场模型更好的适应了网页分块的二维结构,有效的提高信息抽取准确率。实验结果表明，该方法对Web对象信息抽取具有良好的效果。

Abstract: Traditional methods for Web information extraction are mainly based on linear conditional random fields model, which firstly converts the two dimensional structure of Web pages into a one dimensional sequence and then extract information. One of the shortcomings of this type of methods is that the two dimensional dependence of Web objects is overlooked, which may influence the effectiveness of Web information extraction. In this paper, a Web information extraction method is proposed based on the block importance model and two-dimensional conditional random fields (2D CRFs). The proposed method first applies the vision-based page segmentation algorithm to divide the Web page into different blocks. Then support vector machine are used to learn the importance of different blocks. Unimportant blocks are removed so that Web object could be identified more accurately. Finally, the 2D CRFs model is applied to extract information from the Web page. Comparing with linear conditional random fields model, 2D CRFs model fits the two dimensional structure of Web page blocks better, which also improves the accuracy of Web information extraction. Experiments are presented on two datasets and the results show that the proposed method performs excellently on Web information extraction.

吴秦，胡丽娟，梁久祯*. 基于分块重要度和二维条件随机场的Web信息抽取[J]. 南京大学学报(自然科学版), 2014, 50(1): 79–.

Wu Qin, Hu Lijuan, Liang Jiuzhen. Web Information extraction based on block importance model and 2D conditional random fields[J]. Journal of Nanjing University(Natural Sciences), 2014, 50(1): 79–.

参考文献

[1] Laender A, Ribeiro-Neto B, de Silva A, et al. A brief survey of web data extraction tools. SIGMOD Record, 2002, 31(2): 84~93. [2] Nie Z, Ma Y, Shi S. Web object retrieval. Proceedings of the 16th International Conference on World Wide Web. Banff, Canada: ACM, 2007: 81~90. [3] 韩先培,刘康,赵军. 基于布局特征与语言特征的网页主要内容块发现.中文信息学报, 2008, 22(1): 15~21. [4] Song R, Liu H, Wen J R, et al. Learning block importance models for web pages. Proceedings of the 13th International Conference on World Wide Web. New York: ACM, 2004:203~211. [5] Zhu J, Nie Z, Wen J, et al. 2D conditional random fields for web information extraction. Proceedings of the 22th International Conference on Machine Learning. New York: ACM, 2005: 1044~1051. [6] 顾韵华, 田伟. 基于DOM模型扩展的Web信息提取. 计算机科学, 2009, 36(11): 235~237. [7] Chen J L, Zhou B Y, Shi J, et al. Function-based object model towards website adaptation. Proceedings of the 10th World Wide Web Conference. Hong Kong: ACM Press, 2001: 587~596. [8] Chen J J, Jia J Y, Duan L G. DOM semantic expansion-based extraction of topical information from web pages. Web Information Systems and Mining, 2011, 6988: 343~350. [9] Liu L, Calton P, Han W. XWRAP: An XML-enabled wrapper construction system for web information sources. Proceedings of the 16th IEEE International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 2000: 611~621. [10] Lin S H, Ho J M. Discovering informative content blocks from Web documents. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2002: 588~593. [11] Cai D, Yu S, Wen J R, et al. VIPS: A vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-203-79, 2003. [12] 耿焕同,宋庆席,何宏强. 一种基于视觉分块的Web信息抽取方法研究. 情报理论与实践, 2009, 32(3): 106~109. [13] Fayzrakhmanov R. Information extraction from web pages based on their visual representation. Current Trends in Web Engineering, 2012, 7059: 342~346. [14] Wang P, Zhou M Q, You Y, et al. A new vision-based method for extracting academic information from conference web pages. Proceedings of the 24th IEEE International Conference on Tools with Artificial Intelligence. Athens, IEEE, 2012: 976~981. [15] Burges C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998, 2(2): 955~974. [16] Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc, 2001:282~289.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed