南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (6): 1072–.

• • 上一篇    下一篇

基于改进的TF-IDF算法及共现词的主题词抽取算法

公冶小燕1,3,林培光1,2*,任威隆1,张 晨1,张春云1   

  • 出版日期:2017-11-27 发布日期:2017-11-27
  • 作者简介:1.山东财经大学计算机科学与技术学院,济南,250014;
    2.山东大学计算机学院,济南,250002;
    3.曲阜师范大学软件学院,曲阜,273165
  • 基金资助:
    基金项目:教育部人文社会科学研究项目(15YJAZH042),山东省本科高校教学改革研究重点项目(2015Z058)
    收稿日期:2017-09-28
    *通讯联系人,E-mail:linpg@sdufe.edu.cn

A method of extracting subject words based on improved TF-IDF algorithm and co-occurrence words

Gongye Xiaoyan1,3,Lin Peiguang1,2*,Ren Weilong1,Zhang Chen1,Zhang Chunyun1   

  • Online:2017-11-27 Published:2017-11-27
  • About author:1.School of Computer Science and Technology,Shandong University of Finance and Economics,Ji’nan,250014,China;
    2.School of Computer Science and Technology,Shandong University,Ji’nan,250002,China;
    3.School of Software Engineering,Qufu Normal University,Qufu,273165,China

摘要: 信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词权重构建文档—共现词矩阵并建立潜在语义分析(Latent Semantic Analysis,LSA)模型.该方法借助LSA模型的奇异值分解(Singular Value Decomposition,SVD)将文档—共现词矩阵映射到潜在语义空间,不仅实现数据降维,而且获得低维度的文档相似矩阵.最后,对文档相似矩阵进行k-means聚类,在同类文档中选出词权重最大的前几对共现词,作为该类文章的主题词.对比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共现词抽取主题词的实验,该算法的准确度分别提高了19%和10%.

Abstract: The extraction of information topics is a fundamental task for quickly locating users’ needs,and there are three main problems in the extraction of the keywords,which are the calculation of the weight of the word,the measure of the relationship between the words and the data dimension of the disaster,respectively.When it comes to weight computing of the word,the mutual information should be used firstly to determine the covariate word pairs,which is with the non-linear combination of the word frequency,part of speech and the word position information.Then LSA(Latent Semantic Analysis)can be established,according to rebuilt document-co-occurrence matrix.With the SVD(Singular Value Decomposition)of the LSA model,the document-lexical space is mapped to the latent semantic space.This will not only lead to the data dimensionality reduction,but obtains the document similarity matrix with low dimension.Finally,using k-means,our approach clusters the similar matrix of the document,and selects the first few co-occurrence words with the largest mutual information as the keywords of the article.Compared with a method of extracting subject words based on improved TF-IDF(Term Frequency-Inverse Document Frequency)algorithm or co-occurrence words,our approach improves the accuracy rate by 19% and 10% respectively.

[1] Rao Y H.Contextual sentiment topic model for adaptive social emotion classification.IEEE Intelligent Systems,2016,31(1):41-47.
[2] Salton G,Buckley C.Term-weighting approaches in automatic text retrieval.Information Processing & Management,1988,24(5):513-523.
[3] Kapur J N,Sahoo P K,Wong A K C.A new method for gray-level picture thresholding using the entropy of the histogram.Computer Vision,Graphics,and Image Processing,1985,29(3):273-285.
[4] 白秋产,金春霞,章 慧等.词共现文本主题聚类算法.计算机工程与科学,2013,35(7):164-168.(Bai Q C,Jin C X,Zhang H,et al.Topic-text clustering algorithm based on word co-occurrence.Computer Engineering & Science,2013,35(7):164-168.)
[5] 耿焕同,蔡庆生,于 琨等.一种基于词共现图的文档主题词自动抽取方法.南京大学学报(自然科学),2006,42(2):156-162.(Geng H T,Cai Q S,Yu K,et al.A kind of automatic text keyphraseextraction method based on word co-occurrence.Journal of Nanjing University(Natural Sciences),2006,42(2):156-162.)
[6] 钱 强,庞林斌,高 尚.一种基于词共现图的受限领域自动问答系统.计算机应用研究,2013,30(3):841-843.(Qian Q,Pang L B,Gao S.System of automatic question-answering based on term co-occurrence.Application Research of Computers,2013,30(3):841-843.)
[7] Wang Z F,Zarader J L,Argentieri S,et al.A decision system for aircraft faults diagnosis based on classification trees and PCA.In:Lee S,Cho H,Yoon KJ,et al.Intelligent Autonomous Systems 12.Springer Berlin Heidelberg,2013,193:411-422.
[8] Ke Y,Sukthankar R.PCA-SIFT:A more distinctive representation for local image descriptors.In:Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington DC,USA:IEEE,2004,2:506-513.
[9] Kalman D.A singularly valuable decomposition:The SVD of a matrix.The College Mathematics Journal,1996,27(1):2-23.
[10] Alhabashneh O,Iqbal R,Shah N,et al.Towards the development of an integrated framework for enhancing enterprise search using latent semantic indexing.In:Andrews S,Polovina S,Hill R,et al.Conceptual Structures for Discovering Knowledge.Springer Berlin Heidelberg,2011:346-352.
[11] 韩 普,王东波,刘艳云等.词性对中英文文本聚类的影响研究.中文信息学报,2013,27(2):65-73.(Han P,Wang D P,Liu Y Y,et al.Influence of part-of-speech on Chinese and English document clustering.Journal of Chinese Information Processing,2013,27(2):65-73.)
[12] Sulaiman M A,Labadin J.Feature selection based on mutual information.In:2015 9th International Conference on It in Asia(CITA).Kuching,Malaysia,2015:1-6.
[13] Krishnasamy G,Kulkarni A J,Paramesran R.A hybrid approach for data clustering based on modified cohort intelligence and K-means.Expert Systems with Applications,2014,41(13):6009-6016. 
[14] Zhang J Y,Xu W,Zhang W,et al.A novel compression algorithm for infrared thermal image sequence based on K-means method.Infrared Physics & Technology,2014,64:18-25.
[15] Shindler M,Wong A,Meyerson A.Fast and accurate k-means for large datasets.In:Proceedings of the 24th International Conference on Neural Information Processing Systems(NIPS).Granada,Spain:ACM,2011:2375-2383. 
[16] 黄 兴,刘小青,曹步清等.融合K-Means与Agnes的Mashup服务聚类方法.小型微型计算机系统,2015,36(11):2492-2497.(Huang X,Liu X Q,Cao B Q,et al.MSCA:Mashup service clustering approach integrating K-Means and Agnes algorithms.Journal of Chinese Computer Systems,2015,36(11):2492-2497.)
[17] Patwary M M A,Palsetia D,Agrawal A,et al.A new scalable parallel DBSCAN algorithm using the disjoint-set data structure.In:International Conference for High Performance Computing,Networking,Storage and Analysis(SC).Salt Lake City,UT,USA:IEEE,2012:1-11.
[18] Chen L,Yu T,Chirkova R.Wave cluster with differential privacy.In:Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.Melbourne,Australia:ACM,2015:1011-1020.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!