|本期目录/Table of Contents|

[1]公冶小燕,林培光*,任威隆,等.基于改进的TF-IDF算法及共现词的主题词抽取算法[J].南京大学学报(自然科学),2017,53(6):1072.[doi:10.13232/j.cnki.jnju.2017.06.009]
 Gongye Xiaoyan,Lin Peiguang*,Ren Weilong,et al.A method of extracting subject words based on improved TF-IDF algorithm and co-occurrence words[J].Journal of Nanjing University(Natural Sciences),2017,53(6):1072.[doi:10.13232/j.cnki.jnju.2017.06.009]
点击复制

基于改进的TF-IDF算法及共现词的主题词抽取算法()
     

《南京大学学报(自然科学)》[ISSN:0469-5097/CN:32-1169/N]

卷:
53
期数:
2017年第6期
页码:
1072
栏目:
出版日期:
2017-12-01

文章信息/Info

Title:
A method of extracting subject words based on improved TF-IDF algorithm and co-occurrence words
作者:
公冶小燕13林培光12*任威隆1张 晨1张春云1
1.山东财经大学计算机科学与技术学院,济南,250014;
2.山东大学计算机学院,济南,250002;
3.曲阜师范大学软件学院,曲阜,273165
Author(s):
Gongye Xiaoyan13Lin Peiguang12*Ren Weilong1Zhang Chen1Zhang Chunyun1
1.School of Computer Science and Technology,Shandong University of Finance and Economics,Ji’nan,250014,China;
2.School of Computer Science and Technology,Shandong University,Ji’nan,250002,China;
3.School of Software Engineering,Qufu Normal University,Qufu,273165,China
关键词:
 共现词互信息语义分析(LSA)奇异值分解(SVD)Term Frequency-Inverse Document Frequency(TF-IDF)
Keywords:
 co-occurrence wordsmutual informationLatent Semantic Analysis(LSA)Singular Value Decomposition(SVD)Term Frequency-Inverse Document Frequency(TF-IDF)
分类号:
TP311
DOI:
10.13232/j.cnki.jnju.2017.06.009
文献标志码:
A
摘要:
信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词权重构建文档—共现词矩阵并建立潜在语义分析(Latent Semantic Analysis,LSA)模型.该方法借助LSA模型的奇异值分解(Singular Value Decomposition,SVD)将文档—共现词矩阵映射到潜在语义空间,不仅实现数据降维,而且获得低维度的文档相似矩阵.最后,对文档相似矩阵进行k-means聚类,在同类文档中选出词权重最大的前几对共现词,作为该类文章的主题词.对比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共现词抽取主题词的实验,该算法的准确度分别提高了19%和10%.
Abstract:
The extraction of information topics is a fundamental task for quickly locating users’ needs,and there are three main problems in the extraction of the keywords,which are the calculation of the weight of the word,the measure of the relationship between the words and the data dimension of the disaster,respectively.When it comes to weight computing of the word,the mutual information should be used firstly to determine the covariate word pairs,which is with the non-linear combination of the word frequency,part of speech and the word position information.Then LSA(Latent Semantic Analysis)can be established,according to rebuilt document-co-occurrence matrix.With the SVD(Singular Value Decomposition)of the LSA model,the document-lexical space is mapped to the latent semantic space.This will not only lead to the data dimensionality reduction,but obtains the document similarity matrix with low dimension.Finally,using k-means,our approach clusters the similar matrix of the document,and selects the first few co-occurrence words with the largest mutual information as the keywords of the article.Compared with a method of extracting subject words based on improved TF-IDF(Term Frequency-Inverse Document Frequency)algorithm or co-occurrence words,our approach improves the accuracy rate by 19% and 10% respectively.

参考文献/References:

[1] Rao Y H.Contextual sentiment topic model for adaptive social emotion classification.IEEE Intelligent Systems,2016,31(1):41-47.
[2] Salton G,Buckley C.Term-weighting approaches in automatic text retrieval.Information Processing & Management,1988,24(5):513-523.
[3] Kapur J N,Sahoo P K,Wong A K C.A new method for gray-level picture thresholding using the entropy of the histogram.Computer Vision,Graphics,and Image Processing,1985,29(3):273-285.
[4] 白秋产,金春霞,章 慧等.词共现文本主题聚类算法.计算机工程与科学,2013,35(7):164-168.(Bai Q C,Jin C X,Zhang H,et al.Topic-text clustering algorithm based on word co-occurrence.Computer Engineering & Science,2013,35(7):164-168.)
[5] 耿焕同,蔡庆生,于 琨等.一种基于词共现图的文档主题词自动抽取方法.南京大学学报(自然科学),2006,42(2):156-162.(Geng H T,Cai Q S,Yu K,et al.A kind of automatic text keyphraseextraction method based on word co-occurrence.Journal of Nanjing University(Natural Sciences),2006,42(2):156-162.)
[6] 钱 强,庞林斌,高 尚.一种基于词共现图的受限领域自动问答系统.计算机应用研究,2013,30(3):841-843.(Qian Q,Pang L B,Gao S.System of automatic question-answering based on term co-occurrence.Application Research of Computers,2013,30(3):841-843.)
[7] Wang Z F,Zarader J L,Argentieri S,et al.A decision system for aircraft faults diagnosis based on classification trees and PCA.In:Lee S,Cho H,Yoon KJ,et al.Intelligent Autonomous Systems 12.Springer Berlin Heidelberg,2013,193:411-422.
[8] Ke Y,Sukthankar R.PCA-SIFT:A more distinctive representation for local image descriptors.In:Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington DC,USA:IEEE,2004,2:506-513.
[9] Kalman D.A singularly valuable decomposition:The SVD of a matrix.The College Mathematics Journal,1996,27(1):2-23.
[10] Alhabashneh O,Iqbal R,Shah N,et al.Towards the development of an integrated framework for enhancing enterprise search using latent semantic indexing.In:Andrews S,Polovina S,Hill R,et al.Conceptual Structures for Discovering Knowledge.Springer Berlin Heidelberg,2011:346-352.
[11] 韩 普,王东波,刘艳云等.词性对中英文文本聚类的影响研究.中文信息学报,2013,27(2):65-73.(Han P,Wang D P,Liu Y Y,et al.Influence of part-of-speech on Chinese and English document clustering.Journal of Chinese Information Processing,2013,27(2):65-73.)
[12] Sulaiman M A,Labadin J.Feature selection based on mutual information.In:2015 9th International Conference on It in Asia(CITA).Kuching,Malaysia,2015:1-6.
[13] Krishnasamy G,Kulkarni A J,Paramesran R.A hybrid approach for data clustering based on modified cohort intelligence and K-means.Expert Systems with Applications,2014,41(13):6009-6016. 
[14] Zhang J Y,Xu W,Zhang W,et al.A novel compression algorithm for infrared thermal image sequence based on K-means method.Infrared Physics & Technology,2014,64:18-25.
[15] Shindler M,Wong A,Meyerson A.Fast and accurate k-means for large datasets.In:Proceedings of the 24th International Conference on Neural Information Processing Systems(NIPS).Granada,Spain:ACM,2011:2375-2383. 
[16] 黄 兴,刘小青,曹步清等.融合K-Means与Agnes的Mashup服务聚类方法.小型微型计算机系统,2015,36(11):2492-2497.(Huang X,Liu X Q,Cao B Q,et al.MSCA:Mashup service clustering approach integrating K-Means and Agnes algorithms.Journal of Chinese Computer Systems,2015,36(11):2492-2497.)
[17] Patwary M M A,Palsetia D,Agrawal A,et al.A new scalable parallel DBSCAN algorithm using the disjoint-set data structure.In:International Conference for High Performance Computing,Networking,Storage and Analysis(SC).Salt Lake City,UT,USA:IEEE,2012:1-11.
[18] Chen L,Yu T,Chirkova R.Wave cluster with differential privacy.In:Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.Melbourne,Australia:ACM,2015:1011-1020.

相似文献/References:

备注/Memo

备注/Memo:
基金项目:教育部人文社会科学研究项目(15YJAZH042),山东省本科高校教学改革研究重点项目(2015Z058)
收稿日期:2017-09-28
*通讯联系人,E-mail:linpg@sdufe.edu.cn
更新日期/Last Update: 2017-11-27