南京大学学报(自然科学版) ›› 2018, Vol. 54 ›› Issue (6): 11321140.doi: 10.13232/j.cnki.jnju.2018.06.009
孙 玫1*,张 森2,聂培尧2,3,聂秀山2
Sun Mei1*,Zhang Sen2,Nie Peiyao2,3,Nie Xiushan2
摘要: 随着互联网的快速发展,网络查询日志分析技术成为提高网络搜索引擎表现和分析用户搜索行为的关键,而session划分是网络查询日志分析中的一个重要环节. 目前常用的session划分方法主要是基于查询项的时间间隔进行划分,即将一段时间内的查询项视为同一session. 这种方法实施简单,但是划分的准确率不高,无法满足对session划分精确度要求很高的应用场景的要求. 因此提出了一种新的网络查询日志session划分方法——基于朴素贝叶斯的网络查询日志session划分方法. 该方法将session划分问题转化为判断查询项是否为session边界的问题,分析了查询项时间间隔、查询项的语义和相邻查询项的加减词这三种影响session划分的重要因素,并通过朴素贝叶斯法对查询项是否为session边界进行分类,最后设计实验验证了该方法的有效性.
中图分类号:
[1] 罗 成,刘奕群,张 敏等. 基于用户意图识别的查询推荐研究. 中文信息学报,2014,28(1):64-72.(Luo C,Liu Y Q,Zhang M,et al. Query recommendation based on user intent recognition. Journal of Chinese Information Processing,2014,28(1):64-72.) [2] Sordoni A,Bengio Y,Vahabi H,et al. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion ∥ Proceedings of the 24thACM International on Conference on Information and Knowledge Management. New York,NY,USA:ACM,2015:553-562. [3] Shokouhi M,Sloan M,Bennett P N,et al. Query suggestion and data fusion in contextual disambiguation ∥ Proceedings of the 24th International Conference on World Wide Web. Florence,Italy:International World Wide Web Conferences Steering Committee,2015:971-980. [4] 童国平,孙建军. 基于搜索日志的用户行为分析. 现代图书情报技术,2015,31(7-8):80-88.(Tong G P,Sun J J. User behavior analysis based on search engine log. New Technology of Library and Information Service,2015,31(7-8):80-88.) [5] Tan B,Shen X H,Zhai C X. Mining long-term search history to improve search accuracy ∥ Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia,PA,USA:ACM,2006:718-723. [6] Eickhoff C,Teevan J,White R,et al. Lessons from the journey:A query log analysis of within-session learning ∥ Proceedings of the 7th ACM International Conference on Web Search and Data Mining. New York,NY,USA:ACM,2014:223-232. [7] 余慧佳,刘奕群,张 敏等. 基于大规模日志分析的搜索引擎用户行为分析. 中文信息学报,2007,21(1):109-114.(Yu H J,Liu Y Q,Zhang M,et al. Research in search engine user behavior based on log analysis. Journal of Chinese Information Processing,2007,21(1):109-114.) [8] 姚 婷,张 敏,刘奕群等. 低频查询的用户行为分析和类别研究. 计算机研究与发展,2012,49(11):2368-2375.(Yao T,Zhang M,Liu Y Q,et al. Empirical study on rare query categorization. Journal of Computer Research and Development,2012,49(11):2368-2375.) [9] 万 飞,赵 溪,梁 循等. 基于移动互联网日志的搜索引擎用户行为研究. 中文信息学报,2014,28(2):144-150.(Wan F,Zhao X,Liang X,et al. Search behavior study based on the mobile searchLog. Journal of Chinese Information Processing,2014,28(2):144-150.) [10] He D,Gker A. Detecting session boundaries from web user logs ∥ Proceedings of the BCS-IRSG 22nd Annual Colloquium on Information Retrieval Research. Cambridge,UK:IEEE,2000:57-66. [11] 张 森,张 晨,林培光等. 基于用户查询日志的网络搜索主题分析. 智能系统学报,2017,12(5):668-677.(Zhang S,Zhang C,Lin P G,et al. Web search topic analysis based on user search query logs. CAAI Transactions on Intelligent Systems,2017,12(5):668-677.) [12] Jiang D,Leung K W T,Ng W,et al. Beyond click graph:Topic modeling for search engine query log analysis ∥ Meng W,Feng L,Bressan S,et al. Database Systems for Advanced Applications. Springer Berlin Heidelberg,2013:209-223. [13] 张 磊,李亚楠,王 斌等. 网页搜索引擎查询日志的Session划分研究. 中文信息学报,2009,23(2):54-61.(Zhang L,Li Y N,Wang B,et al. Session segmentation based on query logs of web search. Journal of Chinese Information Processing,2009,23(2):54-61.) [14] Mikolov T,Chen K,Corrado G,et al. Efficient estimation of word representations in vector space. arXiv:1301.3781. [15] Le Q,Mikolov T. Distributed representations of sentences and documents ∥ Proceedings of the 31st International Conference on Machine Learning. Beijing,China:Journal of Machine Learning Research,2014:1188 [16] 段旭磊,张仰森,孙祎卓. 微博文本的句向量表示及相似度计算方法研究. 计算机工程,2017,34(5):143-148.(Duan X L,Zhang Y S,Sun Y Z. Research on sentence vector representation and similarity calculation method about microblog texts. Computer Engineering,2017,34(5):143-148.) [17] 杨 雷,曹翠玲,孙建国等. 改进的朴素贝叶斯算法在垃圾邮件过滤中的研究. 通信学报,2017,38(4):140-148.(Yang L,Cao C L,Sun J G,et al. Study on an improved naive Bayes algorithm in spam filtering. Journal on Communications,2017,38(4):140-148.) [18] Chandrasekar P,Qian K. The impact of data preprocessing on the performance of a naive bayes classifier ∥ 2016 IEEE 40th Annual Computer Software and Applications Conference(COMPSAC). Atlanta,GA,USA:IEEE,2016,2:618-619. [19] 钟 磊. 基于贝叶斯分类器的中文文本分类. 电子技术与软件工程,2016(22):156-156.(Zhong L. Chinese text classification based on Bayesian-classifier. Electronic Technology & Software Engineering,2016(22):156-156.) [20] Jiang L X,Li C Q,Wang S S,et al. Deep feature weighting for naive Bayes and its application to text classification. Engineering Applications of Artificial Intelligence,2016,52:26-39. |
[1] | 徐久成,李晓艳**,孙林 . 一种基于概率粗糙集模型的图像语义检索方法*[J]. 南京大学学报(自然科学版), 2011, 47(4): 438-455. |
|