南京大学学报(自然科学版) ›› 2010, Vol. 46 ›› Issue (5): 542–551.

• • 上一篇    下一篇

 网页搜索结果聚类与可视化*

 赵华军, 钟才明, 李 文, 王睿智, 苗夺谦**   

  • 出版日期:2015-04-02 发布日期:2015-04-02
  • 作者简介: ( 同济大学嵌入式系统与服务计算教育部重点实验室, 计算机科学与技术系, 上海, 201804)
  • 基金资助:
     国家自然科学基金( 60475019, 60970061) , 博士学科点专项基金( 20060247039)

 Clustering and visualization of web search results

 Zhao H ua J un, Zhong Cai Ming, Li Wen, Wang Rui Zhi, Miao Duo?Qian
  

  • Online:2015-04-02 Published:2015-04-02
  • About author: ( Key Laboratory of Embedded System and Service Computing, M inistry of Education of China,
    Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China)

摘要:  搜索引擎成为当今在互联网上进行信息检索最常用的工具. 主流搜索引擎以与用户查询的相关度排序返回搜索结果, 且自然语言中存在的“ 一义多词” 和“ 一词多义” 现象, 用户很难清楚表达他们的
意图, 导致往往花费较长时间从结果列表中选择所感兴趣的话题. 针对这种状况, 采用网页聚类技术对标题和摘要进行聚类后, 并可视化地以树和图的方式向用户快速、 全貌和直观地展示搜索结果, 明显改
善了用户搜索体验. 在此基础上设计了网页聚类原型系统 ECE( effective clustering engine), 实验结果表明该算法具有聚类结果可读性好以及聚类准确度比较高的优点.

Abstract:  Nowadays search engines are the most common tools for information retrieval on the internet. However, there are several limitations such as low search coverage and dynamic characteristic of web pages, it is the reason
why no breakthrough made on users! searching experience recent years. The leading search engines will return a long list of records that are sorted by the correlation with the queries, the phenomena of synonymy and polysemy
make users express their intention difficultly and spend much time on selecting web pages they are interested in. This paper aims at enhancing searching experience using data analysis technologies. T hrough clustering and
visualizing web search results, then grouping the clustering results according to some criterions, it makes users locate their interested information quickly. The data structure related to suffix tree are being widely used in string
processing and text compression. The clustering algorithm based on suffix tree which makes it easy to recognize the shared phrases among web pages can be used to cluster web pages, it improves the clustering efficiency as not to
calculate the similarities between pair ?wise documents, and assigns meaningful labels for the clustering results toenhance the readability, also improves end users! searching experience through visualization. An effective clustering
engine prototype system named effective clustering engine has been built on this approach. The algorithm is quite efficient, and the clustering results are readable and accurate verified by the experiments.

 [ 1 ] Zeng H J, He Q C, Chen Z, et al. Learning to cluster web search Results. Proceedings of the 27 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, 2004: 210~ 217.
[ 2 ] Zhang D, Dong Y S. Semantic, hierarchical, online clustering of web search results. Proceedings of the Advanced Web Technologies and Applications, the 6 th Asia-Pacific Web Conference, 2004, 3007: 69~ 78.
[ 3 ]  Cutting D, Karger D, Pedersen J, et al. Scatter/ Gather: A cluster?based approach to browsing large document collections. Proceedings of the 15 th Annual International ACM SIGIR Conference on Research
 and Development in Infor-mation Retrieval, Copenhagen, 1992, 318~ 392.
[ 4 ] Zamir O, Etzioni O. Grouper: A dynamic clustering interface to web search results. Computer Networks, 1999, 31(11- 16): 1361~ 1374.
[ 5 ]  Weiss D, Osinski S. Carrot 2 open source frame work for building search clustering engines. http: // project. carrot2. org/ . 2008- 03.
[ 6 ]  Osinski S, Stefanowski J, Weiss D. Lingo: Search results clustering algorithm based on singular value decomposition. Proceedings of the International Conference on Intelligent Informa tion Systems (IIPWM) , 2004, 359~ 368.
[ 7 ]  Giacomo E, Didimo D, Grilli L, et al. Graph visualization techniques for web clustering engines. IEEE Transactions on Visualization and Computer Graphics, 2007, 13( 2) : 294~ 304.
[ 8 ]  Gulli A. Personalized sankeT . http: // snaket. di. unipi. it/ . 2005- 06.
[ 9 ]  Vivisimo Company. Vivisimo information optimized. http: / /vivisimo. com/ . 2008- 05. [ 10]  Zheng M M , Ji G L. An improved density based distributed clustering. Journal of Nanjing
U niversity ( Natural Sciences) , 2008, 44( 5) : 536~ 543. (郑苗苗, 吉根林. 一种基于密度的 分布式聚类算法. 南京大学学报( 自然科学) , 2008, 44(5): 536~ 543) .
[ 11] Hammouda K, Kamel M. Efficient phrase based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 2004, 16 ( 10 ) : 1279~ 1296.
[ 12]  Sun J G, Liu J, Zhao L Y. Clustering algorithm research. Journal of Software, 2008, 19( 1) : 48~ 61. (孙吉贵, 刘 ? 杰, 赵连宇. 聚类算法研究. 软件学报, 2008, 19(1): 48~ 61) .
[ 13]  Runkler T and Bezdek J. Web mining with rela tional clustering. International Journal of Approximate Reasoning, 2003, 32 ( 2/3 ) : 217~ 236.
[ 14]  Zamir O, Etzioni O. Web document clustering: A feasibility demonstration. Proceedings of the 21 st Annual International ACM SIGIR Conference on Research and Development in Informa tion Retrieval ( SIGIR! 98) , Melbourne, 1998: 46~ 54.
[ 15]  Jiang Z H, Joshi A, Krishnapuram R, et al. Retriever: Improving web search engine results using clustering. M anaging Businesses with Electronic Commerce: Issues and T rends. Idea Group Publishing, 2002, 59~ 81.
[ 16] Zamir O. Clustering web documents: A phrased-based method for grouping search engine results. Doctoral dissertation. Washington: University of Washington, 1999.
[ 17] Weiner P. Linear pattern matching algorithm. Proceedings of the 14 th Annual Symposium on Foundations of Computer Science, 1973, 1~ 11.
[ 18]  McCreight E M. A space -economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 1976, 23: 262~ 272.
[ 19]  Ukkonen E. On?line construction of suffix trees. Algorithmica, 1995, 14: 249~ 260.
[ 20]  Porter M. The porter stemming algorithm. http: // tartarus. org/ martin/ PorterStemmer/. 2006- 06.
[ 21] Kraaij W, Pohlmann R. Viewing stemming as recall enhancement. Proceedings of the 19 th Annual International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, Zurich, 1996, 40~ 48.
[ 22] Borch H O. On?line clustering of web search results. Master thesis. Norwegian: Norwegian University of Science and T echnology, 2006.
[ 23] The Univeristy of Waikato. WEKA software. http: // www. cs. waikato. ac. nz/ ~ ml/ weka/. 2005- 07.
[ 24]  Li J C. Research and implemention on web online clustering. Master thesis. Shanghai: Shanghai Jiaotong University, 2007. ( 李建超.网页在线聚类的研究与实现. 硕士论文. 上海: 上海交通大学, 2007) .
[ 25]  Liu Y C, Wang X L. A survey of document clustering. Journal of Chinese Information Processing, 2006, 20( 3) : 55~ 62. ( 刘远超, 王晓龙. 文档聚类综述. 中文信息学报, 2006, 20 (3): 55~ 62).
[ 26]  Lewis D D. Test collections: Reuters-21578. http: // www. daviddlewis. com/resources/ testcollections/ reuters2158/. 2000- 01.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!