南京大学学报(自然科学版) ›› 2018, Vol. 54 ›› Issue (6): 1132–1140.doi: 10.13232/j.cnki.jnju.2018.06.009

• • 上一篇    下一篇

基于朴素贝叶斯的网络查询日志session划分方法研究

孙 玫*,张 森2,聂培尧2,3,聂秀山2   

  1. 1.山东财经大学财政税务学院,济南,250014;2.山东财经大学计算机科学与技术学院,济南,250014;3.三亚学院信息与智能工程学院,三亚,572022
  • 接受日期:2018-08-06 出版日期:2018-12-01 发布日期:2018-12-01
  • 通讯作者: 孙 玫, llpwgh@163.com E-mail:llpwgh@163.com
  • 基金资助:
    教育部人文社会科学研究项目(15YJAZH042)

Research on session segmentation of web search query logs based on naive Bayes

Sun Mei1*,Zhang Sen2,Nie Peiyao2,3,Nie Xiushan2   

  1. 1.School of Public Finance and Taxation,Shandong University of Finance and Economics,Ji’nan,250014,China; 2. School of Computer Science and Technology,Shandong University of Finance and Economics,Ji’nan,250014,China; 3. School of Information and Intelligence Engineering,Sanya University,Sanya,572022,China
  • Accepted:2018-08-06 Online:2018-12-01 Published:2018-12-01
  • Contact: Sun Mei, llpwgh@163.com E-mail:llpwgh@163.com

摘要: 随着互联网的快速发展,网络查询日志分析技术成为提高网络搜索引擎表现和分析用户搜索行为的关键,而session划分是网络查询日志分析中的一个重要环节. 目前常用的session划分方法主要是基于查询项的时间间隔进行划分,即将一段时间内的查询项视为同一session. 这种方法实施简单,但是划分的准确率不高,无法满足对session划分精确度要求很高的应用场景的要求. 因此提出了一种新的网络查询日志session划分方法——基于朴素贝叶斯的网络查询日志session划分方法. 该方法将session划分问题转化为判断查询项是否为session边界的问题,分析了查询项时间间隔、查询项的语义和相邻查询项的加减词这三种影响session划分的重要因素,并通过朴素贝叶斯法对查询项是否为session边界进行分类,最后设计实验验证了该方法的有效性.

关键词: 网络搜索, session划分, 朴素贝叶斯, 时间间隔, 查询项语义

Abstract: With the development of Internet,web search query log analysis technology has become the key to improve the performance of web search engines and analyze user search behavior. Meanwhile,session segmentation is an important part of the web search query logs analysis. The session in the web search query log refers to a plurality of search query activities in which the user performs the same or similar intent within a time period,and is also a basic data processing unit commonly used in data processing of the web search query logs. Currently,the commonly used session segmentation method is mainly based on the time interval of the query items,and the query items within a certain period of time are regarded as the same session. This method is simple to implement,but the accuracy of the partitioning is not high,and it cannot meet the requirements of the application scene with high requirements for session partitioning accuracy. Therefore,in this paper,we propose a new web search query logs session segmentation method: session segmentation of web search query logs based on naive Bayes. In order to divide the query log into different sessions,we translate the session segmentation problem into judging whether the query item is a session boundary. Then,we analyze three important factors of the session segmentation,the query item time interval,the query item semantics and the addition of neighbor query items. When calculating the semantic similarity of query items,we adopt the representation method of word vector in deep learning,and then the Query2Vector model is proposed. The query items are represented by vectors,and the similarities of query items are calculated. Next,we apply naive Bayes method to classify whether the query item is a session boundary. Finally,we design a series of experiments to verify the effectiveness of the proposed method,and the results show that our methods proposed in this paper are more precise and credible than other common methods.

Key words: web search, session segmentation, naive Bayes, time interval, query item semantics

中图分类号: 

  • TP391
[1] 罗 成,刘奕群,张 敏等. 基于用户意图识别的查询推荐研究. 中文信息学报,2014,28(1):64-72.(Luo C,Liu Y Q,Zhang M,et al. Query recommendation based on user intent recognition. Journal of Chinese Information Processing,2014,28(1):64-72.)
[2] Sordoni A,Bengio Y,Vahabi H,et al. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion ∥ Proceedings of the 24thACM International on Conference on Information and Knowledge Management. New York,NY,USA:ACM,2015:553-562.
[3] Shokouhi M,Sloan M,Bennett P N,et al. Query suggestion and data fusion in contextual disambiguation ∥ Proceedings of the 24th International Conference on World Wide Web. Florence,Italy:International World Wide Web Conferences Steering Committee,2015:971-980.
[4] 童国平,孙建军. 基于搜索日志的用户行为分析. 现代图书情报技术,2015,31(7-8):80-88.(Tong G P,Sun J J. User behavior analysis based on search engine log. New Technology of Library and Information Service,2015,31(7-8):80-88.)
[5] Tan B,Shen X H,Zhai C X. Mining long-term search history to improve search accuracy ∥ Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia,PA,USA:ACM,2006:718-723.
[6] Eickhoff C,Teevan J,White R,et al. Lessons from the journey:A query log analysis of within-session learning ∥ Proceedings of the 7th ACM International Conference on Web Search and Data Mining. New York,NY,USA:ACM,2014:223-232.
[7] 余慧佳,刘奕群,张 敏等. 基于大规模日志分析的搜索引擎用户行为分析. 中文信息学报,2007,21(1):109-114.(Yu H J,Liu Y Q,Zhang M,et al. Research in search engine user behavior based on log analysis. Journal of Chinese Information Processing,2007,21(1):109-114.)
[8] 姚 婷,张 敏,刘奕群等. 低频查询的用户行为分析和类别研究. 计算机研究与发展,2012,49(11):2368-2375.(Yao T,Zhang M,Liu Y Q,et al. Empirical study on rare query categorization. Journal of Computer Research and Development,2012,49(11):2368-2375.)
[9] 万 飞,赵 溪,梁 循等. 基于移动互联网日志的搜索引擎用户行为研究. 中文信息学报,2014,28(2):144-150.(Wan F,Zhao X,Liang X,et al. Search behavior study based on the mobile searchLog. Journal of Chinese Information Processing,2014,28(2):144-150.)
[10] He D,Gker A. Detecting session boundaries from web user logs ∥ Proceedings of the BCS-IRSG 22nd Annual Colloquium on Information Retrieval Research. Cambridge,UK:IEEE,2000:57-66.
[11] 张 森,张 晨,林培光等. 基于用户查询日志的网络搜索主题分析. 智能系统学报,2017,12(5):668-677.(Zhang S,Zhang C,Lin P G,et al. Web search topic analysis based on user search query logs. CAAI Transactions on Intelligent Systems,2017,12(5):668-677.)
[12] Jiang D,Leung K W T,Ng W,et al. Beyond click graph:Topic modeling for search engine query log analysis ∥ Meng W,Feng L,Bressan S,et al. Database Systems for Advanced Applications. Springer Berlin Heidelberg,2013:209-223.
[13] 张 磊,李亚楠,王 斌等. 网页搜索引擎查询日志的Session划分研究. 中文信息学报,2009,23(2):54-61.(Zhang L,Li Y N,Wang B,et al. Session segmentation based on query logs of web search. Journal of Chinese Information Processing,2009,23(2):54-61.)
[14] Mikolov T,Chen K,Corrado G,et al. Efficient estimation of word representations in vector space. arXiv:1301.3781.
[15] Le Q,Mikolov T. Distributed representations of sentences and documents ∥ Proceedings of the 31st International Conference on Machine Learning. Beijing,China:Journal of Machine Learning Research,2014:1188
[16] 段旭磊,张仰森,孙祎卓. 微博文本的句向量表示及相似度计算方法研究. 计算机工程,2017,34(5):143-148.(Duan X L,Zhang Y S,Sun Y Z. Research on sentence vector representation and similarity calculation method about microblog texts. Computer Engineering,2017,34(5):143-148.)
[17] 杨 雷,曹翠玲,孙建国等. 改进的朴素贝叶斯算法在垃圾邮件过滤中的研究. 通信学报,2017,38(4):140-148.(Yang L,Cao C L,Sun J G,et al. Study on an improved naive Bayes algorithm in spam filtering. Journal on Communications,2017,38(4):140-148.)
[18] Chandrasekar P,Qian K. The impact of data preprocessing on the performance of a naive bayes classifier ∥ 2016 IEEE 40th Annual Computer Software and Applications Conference(COMPSAC). Atlanta,GA,USA:IEEE,2016,2:618-619.
[19] 钟 磊. 基于贝叶斯分类器的中文文本分类. 电子技术与软件工程,2016(22):156-156.(Zhong L. Chinese text classification based on Bayesian-classifier. Electronic Technology & Software Engineering,2016(22):156-156.)
[20] Jiang L X,Li C Q,Wang S S,et al. Deep feature weighting for naive Bayes and its application to text classification. Engineering Applications of Artificial Intelligence,2016,52:26-39.
[1]  徐久成,李晓艳**,孙林
.  一种基于概率粗糙集模型的图像语义检索方法*[J]. 南京大学学报(自然科学版), 2011, 47(4): 438-455.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 段新春,施 斌,孙梦雅,魏广庆,顾 凯,冯晨曦. FBG蒸发式湿度计研制及其响应特性研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1075 -1084 .
[2] 汪 勇,刘 瑾*,宋泽卓,白玉霞,王琼亚,祁长青,孙少锐. 高分子稳定剂加固河道边坡表层砂土室内试验研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1095 -1104 .
[3] 卢 毅,于 军,龚绪龙,王宝军,魏广庆,季峻峰. 基于DFOS的连云港第四纪地层地面沉降监测分析[J]. 南京大学学报(自然科学版), 2018, 54(6): 1114 -1123 .
[4] 胡 淼,王开军,李海超,陈黎飞. 模糊树节点的随机森林与异常点检测[J]. 南京大学学报(自然科学版), 2018, 54(6): 1141 -1151 .
[5] 周星星,张海平,吉根林. 具有时空特性的区域移动模式挖掘算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1171 -1182 .
[6] 赵天龙,刘 峥,韩慧健,张彩明. 基于二分图的个性化图像标签推荐算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1193 -1205 .
[7] 韩明鸣, 郭虎升, 王文剑. 面向非平衡多分类问题的二次合成QSMOTE方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 1 -13 .
[8] 刘 素, 刘惊雷. 基于特征选择的CP-nets结构学习[J]. 南京大学学报(自然科学版), 2019, 55(1): 14 -28 .
[9] 王伯伟, 聂秀山, 马林元, 尹义龙. 基于语义相似度的无监督图像哈希方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 41 -48 .
[10] 孔 颉, 孙权森, 纪则轩, 刘亚洲. 基于仿射不变离散哈希的遥感图像快速目标检测新方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 49 -60 .