南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (6): 1072.
公冶小燕1,3,林培光1,2*,任威隆1,张 晨1,张春云1
Gongye Xiaoyan1,3,Lin Peiguang1,2*,Ren Weilong1,Zhang Chen1,Zhang Chunyun1
摘要: 信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词权重构建文档—共现词矩阵并建立潜在语义分析(Latent Semantic Analysis,LSA)模型.该方法借助LSA模型的奇异值分解(Singular Value Decomposition,SVD)将文档—共现词矩阵映射到潜在语义空间,不仅实现数据降维,而且获得低维度的文档相似矩阵.最后,对文档相似矩阵进行k-means聚类,在同类文档中选出词权重最大的前几对共现词,作为该类文章的主题词.对比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共现词抽取主题词的实验,该算法的准确度分别提高了19%和10%.
[1] Rao Y H.Contextual sentiment topic model for adaptive social emotion classification.IEEE Intelligent Systems,2016,31(1):41-47. [2] Salton G,Buckley C.Term-weighting approaches in automatic text retrieval.Information Processing & Management,1988,24(5):513-523. [3] Kapur J N,Sahoo P K,Wong A K C.A new method for gray-level picture thresholding using the entropy of the histogram.Computer Vision,Graphics,and Image Processing,1985,29(3):273-285. [4] 白秋产,金春霞,章 慧等.词共现文本主题聚类算法.计算机工程与科学,2013,35(7):164-168.(Bai Q C,Jin C X,Zhang H,et al.Topic-text clustering algorithm based on word co-occurrence.Computer Engineering & Science,2013,35(7):164-168.) [5] 耿焕同,蔡庆生,于 琨等.一种基于词共现图的文档主题词自动抽取方法.南京大学学报(自然科学),2006,42(2):156-162.(Geng H T,Cai Q S,Yu K,et al.A kind of automatic text keyphraseextraction method based on word co-occurrence.Journal of Nanjing University(Natural Sciences),2006,42(2):156-162.) [6] 钱 强,庞林斌,高 尚.一种基于词共现图的受限领域自动问答系统.计算机应用研究,2013,30(3):841-843.(Qian Q,Pang L B,Gao S.System of automatic question-answering based on term co-occurrence.Application Research of Computers,2013,30(3):841-843.) [7] Wang Z F,Zarader J L,Argentieri S,et al.A decision system for aircraft faults diagnosis based on classification trees and PCA.In:Lee S,Cho H,Yoon KJ,et al.Intelligent Autonomous Systems 12.Springer Berlin Heidelberg,2013,193:411-422. [8] Ke Y,Sukthankar R.PCA-SIFT:A more distinctive representation for local image descriptors.In:Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington DC,USA:IEEE,2004,2:506-513. [9] Kalman D.A singularly valuable decomposition:The SVD of a matrix.The College Mathematics Journal,1996,27(1):2-23. [10] Alhabashneh O,Iqbal R,Shah N,et al.Towards the development of an integrated framework for enhancing enterprise search using latent semantic indexing.In:Andrews S,Polovina S,Hill R,et al.Conceptual Structures for Discovering Knowledge.Springer Berlin Heidelberg,2011:346-352. [11] 韩 普,王东波,刘艳云等.词性对中英文文本聚类的影响研究.中文信息学报,2013,27(2):65-73.(Han P,Wang D P,Liu Y Y,et al.Influence of part-of-speech on Chinese and English document clustering.Journal of Chinese Information Processing,2013,27(2):65-73.) [12] Sulaiman M A,Labadin J.Feature selection based on mutual information.In:2015 9th International Conference on It in Asia(CITA).Kuching,Malaysia,2015:1-6. [13] Krishnasamy G,Kulkarni A J,Paramesran R.A hybrid approach for data clustering based on modified cohort intelligence and K-means.Expert Systems with Applications,2014,41(13):6009-6016. [14] Zhang J Y,Xu W,Zhang W,et al.A novel compression algorithm for infrared thermal image sequence based on K-means method.Infrared Physics & Technology,2014,64:18-25. [15] Shindler M,Wong A,Meyerson A.Fast and accurate k-means for large datasets.In:Proceedings of the 24th International Conference on Neural Information Processing Systems(NIPS).Granada,Spain:ACM,2011:2375-2383. [16] 黄 兴,刘小青,曹步清等.融合K-Means与Agnes的Mashup服务聚类方法.小型微型计算机系统,2015,36(11):2492-2497.(Huang X,Liu X Q,Cao B Q,et al.MSCA:Mashup service clustering approach integrating K-Means and Agnes algorithms.Journal of Chinese Computer Systems,2015,36(11):2492-2497.) [17] Patwary M M A,Palsetia D,Agrawal A,et al.A new scalable parallel DBSCAN algorithm using the disjoint-set data structure.In:International Conference for High Performance Computing,Networking,Storage and Analysis(SC).Salt Lake City,UT,USA:IEEE,2012:1-11. [18] Chen L,Yu T,Chirkova R.Wave cluster with differential privacy.In:Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.Melbourne,Australia:ACM,2015:1011-1020. |
No related articles found! |
|