1.School of Computer Science and Technology,Shandong University of Finance and Economics,Ji’nan,250014,China; 2.School of Computer Science and Technology,Shandong University,Ji’nan,250002,China; 3.School of Software Engineering,Qufu Normal University,Qufu,273165,China
{{custom_authorNodes}}
{{custom_bio.content}}
{{custom_bio.content}}
{{custom_authorNodes}}
Collapse
History+
Published
2017-11-27
Issue Date
2017-11-27
Abstract
The extraction of information topics is a fundamental task for quickly locating users’ needs,and there are three main problems in the extraction of the keywords,which are the calculation of the weight of the word,the measure of the relationship between the words and the data dimension of the disaster,respectively.When it comes to weight computing of the word,the mutual information should be used firstly to determine the covariate word pairs,which is with the non-linear combination of the word frequency,part of speech and the word position information.Then LSA(Latent Semantic Analysis)can be established,according to rebuilt document-co-occurrence matrix.With the SVD(Singular Value Decomposition)of the LSA model,the document-lexical space is mapped to the latent semantic space.This will not only lead to the data dimensionality reduction,but obtains the document similarity matrix with low dimension.Finally,using k-means,our approach clusters the similar matrix of the document,and selects the first few co-occurrence words with the largest mutual information as the keywords of the article.Compared with a method of extracting subject words based on improved TF-IDF(Term Frequency-Inverse Document Frequency)algorithm or co-occurrence words,our approach improves the accuracy rate by 19% and 10% respectively.
Gongye Xiaoyan1,3,Lin Peiguang1,2*,Ren Weilong1,Zhang Chen1,Zhang Chunyun1.
A method of extracting subject words based on improved TF-IDF algorithm and co-occurrence words[J]. Journal of Nanjing University(Natural Sciences), 2017, 53(6): 1072
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
References
[1] Rao Y H.Contextual sentiment topic model for adaptive social emotion classification.IEEE Intelligent Systems,2016,31(1):41-47. [2] Salton G,Buckley C.Term-weighting approaches in automatic text retrieval.Information Processing & Management,1988,24(5):513-523. [3] Kapur J N,Sahoo P K,Wong A K C.A new method for gray-level picture thresholding using the entropy of the histogram.Computer Vision,Graphics,and Image Processing,1985,29(3):273-285. [4] 白秋产,金春霞,章 慧等.词共现文本主题聚类算法.计算机工程与科学,2013,35(7):164-168.(Bai Q C,Jin C X,Zhang H,et al.Topic-text clustering algorithm based on word co-occurrence.Computer Engineering & Science,2013,35(7):164-168.) [5] 耿焕同,蔡庆生,于 琨等.一种基于词共现图的文档主题词自动抽取方法.南京大学学报(自然科学),2006,42(2):156-162.(Geng H T,Cai Q S,Yu K,et al.A kind of automatic text keyphraseextraction method based on word co-occurrence.Journal of Nanjing University(Natural Sciences),2006,42(2):156-162.) [6] 钱 强,庞林斌,高 尚.一种基于词共现图的受限领域自动问答系统.计算机应用研究,2013,30(3):841-843.(Qian Q,Pang L B,Gao S.System of automatic question-answering based on term co-occurrence.Application Research of Computers,2013,30(3):841-843.) [7] Wang Z F,Zarader J L,Argentieri S,et al.A decision system for aircraft faults diagnosis based on classification trees and PCA.In:Lee S,Cho H,Yoon KJ,et al.Intelligent Autonomous Systems 12.Springer Berlin Heidelberg,2013,193:411-422. [8] Ke Y,Sukthankar R.PCA-SIFT:A more distinctive representation for local image descriptors.In:Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington DC,USA:IEEE,2004,2:506-513. [9] Kalman D.A singularly valuable decomposition:The SVD of a matrix.The College Mathematics Journal,1996,27(1):2-23. [10] Alhabashneh O,Iqbal R,Shah N,et al.Towards the development of an integrated framework for enhancing enterprise search using latent semantic indexing.In:Andrews S,Polovina S,Hill R,et al.Conceptual Structures for Discovering Knowledge.Springer Berlin Heidelberg,2011:346-352. [11] 韩 普,王东波,刘艳云等.词性对中英文文本聚类的影响研究.中文信息学报,2013,27(2):65-73.(Han P,Wang D P,Liu Y Y,et al.Influence of part-of-speech on Chinese and English document clustering.Journal of Chinese Information Processing,2013,27(2):65-73.) [12] Sulaiman M A,Labadin J.Feature selection based on mutual information.In:2015 9th International Conference on It in Asia(CITA).Kuching,Malaysia,2015:1-6. [13] Krishnasamy G,Kulkarni A J,Paramesran R.A hybrid approach for data clustering based on modified cohort intelligence and K-means.Expert Systems with Applications,2014,41(13):6009-6016. [14] Zhang J Y,Xu W,Zhang W,et al.A novel compression algorithm for infrared thermal image sequence based on K-means method.Infrared Physics & Technology,2014,64:18-25. [15] Shindler M,Wong A,Meyerson A.Fast and accurate k-means for large datasets.In:Proceedings of the 24th International Conference on Neural Information Processing Systems(NIPS).Granada,Spain:ACM,2011:2375-2383. [16] 黄 兴,刘小青,曹步清等.融合K-Means与Agnes的Mashup服务聚类方法.小型微型计算机系统,2015,36(11):2492-2497.(Huang X,Liu X Q,Cao B Q,et al.MSCA:Mashup service clustering approach integrating K-Means and Agnes algorithms.Journal of Chinese Computer Systems,2015,36(11):2492-2497.) [17] Patwary M M A,Palsetia D,Agrawal A,et al.A new scalable parallel DBSCAN algorithm using the disjoint-set data structure.In:International Conference for High Performance Computing,Networking,Storage and Analysis(SC).Salt Lake City,UT,USA:IEEE,2012:1-11. [18] Chen L,Yu T,Chirkova R.Wave cluster with differential privacy.In:Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.Melbourne,Australia:ACM,2015:1011-1020.