南京大学学报(自然科学版) ›› 2014, Vol. 50 ›› Issue (1): 79.
吴 秦,胡丽娟,梁久祯*
Wu Qin, Hu Lijuan, Liang Jiuzhen
摘要: 网页分块方法使得Web信息抽取的单位由原来的页面缩小为分块。结合分块重要度模型与二维条件随机场的优点,提出一种Web对象信息抽取方法。该方法利用分块重要度模型对网页分块进行重要度标注,过滤掉大量与主题无关信息,更加准确的定位待抽取信息的位置。二维条件随机场模型相比传统的线性条件随机场模型更好的适应了网页分块的二维结构,有效的提高信息抽取准确率。实验结果表明,该方法对Web对象信息抽取具有良好的效果。
[1] Laender A, Ribeiro-Neto B, de Silva A, et al. A brief survey of web data extraction tools. SIGMOD Record, 2002, 31(2): 84~93. [2] Nie Z, Ma Y, Shi S. Web object retrieval. Proceedings of the 16th International Conference on World Wide Web. Banff, Canada: ACM, 2007: 81~90. [3] 韩先培,刘 康,赵 军. 基于布局特征与语言特征的网页主要内容块发现.中文信息学报, 2008, 22(1): 15~21. [4] Song R, Liu H, Wen J R, et al. Learning block importance models for web pages. Proceedings of the 13th International Conference on World Wide Web. New York: ACM, 2004:203~211. [5] Zhu J, Nie Z, Wen J, et al. 2D conditional random fields for web information extraction. Proceedings of the 22th International Conference on Machine Learning. New York: ACM, 2005: 1044~1051. [6] 顾韵华, 田 伟. 基于DOM模型扩展的Web信息提取. 计算机科学, 2009, 36(11): 235~237. [7] Chen J L, Zhou B Y, Shi J, et al. Function-based object model towards website adaptation. Proceedings of the 10th World Wide Web Conference. Hong Kong: ACM Press, 2001: 587~596. [8] Chen J J, Jia J Y, Duan L G. DOM semantic expansion-based extraction of topical information from web pages. Web Information Systems and Mining, 2011, 6988: 343~350. [9] Liu L, Calton P, Han W. XWRAP: An XML-enabled wrapper construction system for web information sources. Proceedings of the 16th IEEE International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 2000: 611~621. [10] Lin S H, Ho J M. Discovering informative content blocks from Web documents. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2002: 588~593. [11] Cai D, Yu S, Wen J R, et al. VIPS: A vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-203-79, 2003. [12] 耿焕同,宋庆席,何宏强. 一种基于视觉分块的Web信息抽取方法研究. 情报理论与实践, 2009, 32(3): 106~109. [13] Fayzrakhmanov R. Information extraction from web pages based on their visual representation. Current Trends in Web Engineering, 2012, 7059: 342~346. [14] Wang P, Zhou M Q, You Y, et al. A new vision-based method for extracting academic information from conference web pages. Proceedings of the 24th IEEE International Conference on Tools with Artificial Intelligence. Athens, IEEE, 2012: 976~981. [15] Burges C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998, 2(2): 955~974. [16] Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc, 2001:282~289. |
No related articles found! |
|