一种基于t－分布随机近邻嵌入的文本聚类方法

doi:10.13232/j.cnki.jnju.2019.02.011

南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (2): 264–271.doi: 10.13232/j.cnki.jnju.2019.02.011

一种基于t－分布随机近邻嵌入的文本聚类方法

徐秀芳¹*，徐　森¹，花小朋¹，徐　静¹，皋　军^1,2，安　晶³

1.盐城工学院信息工程学院，盐城，224051；2.江苏省媒体设计与软件技术重点实验室(江南大学)，无锡，214122； 3．盐城工学院机械工程学院，盐城，224051

接受日期:2018-12-10 出版日期:2019-04-01 发布日期:2019-03-31
通讯作者: 徐秀芳 E-mail:xxf@ycit.cn
基金资助:
国家自然科学基金(61105057，61375001)，江苏省自然科学基金(BK20151299)，江苏省“333工程”，江苏省高等学校自然科学研究项目(18KJB520050)，江苏省媒体设计与软件技术重点实验室(江南大学)开放课题(18ST0201)，江苏省高校“青蓝工程”

A document clustering approach based on t－distributed stochastic neighbor embedding

Xu Xiufang¹*，Xu Sen¹，Hua Xiaopeng¹，Xu Jing¹，Gao Jun^1,2，An Jing¹

1.School of Information Engineering，Yancheng Institute of Technology，Yancheng，224051，China； 2．Jiangsu Key Laboratory of Media Design and Software Technology，Jiangnan University，Wuxi，214122，China； 3．School of Mechanical Engineer，Yancheng Institute of Technology，Yancheng，224051，China

Accepted:2018-12-10 Online:2019-04-01 Published:2019-03-31
Contact: Xu Xiufang E-mail:xxf@ycit.cn

摘要/Abstract

摘要： 文本数据具有高维、稀疏、海量的特性，给传统的聚类算法带来了极大挑战. 提出一种基于t－分布随机近邻嵌入(t－Distributed Stochastic Neighbor Embedding，t－SNE)的文本聚类方法. 首先通过t－SNE将高维文本数据嵌入到低维空间，使得高维空间相似度较低的文本对应的映射点距离较远，而相似度较高的文本对应的映射点距离较近；然后根据低维空间映射点坐标，再采用传统的聚类分析算法进行聚类，得到最终的聚类结果. 在多个基准文本数据集上进行了实验测试，验证了该方法的有效性．

关键词: 聚类分析, 文本聚类, 维数约简, 随机近邻嵌入, 聚类算法

Abstract: Document data is typically high－dimensional，very sparse and massive，which greatly challenges to traditional clustering algorithms. A document clustering approach based on t－distributed Stochastic Neighbor Embedding(t－SNE)was proposed. Firstly，high－dimensional document vectors are embedded into low dimensional space via t－SNE，so that the distance between corresponding embedding points are large if the similarity between high－dimension document vectors is small and small if the similarity is large. Then traditional clustering algorithms are performed to yield the final clustering results using the coordinates of the mapping points lying in the low－dimensional space. Experimental results on several baseline document datasets demonstrated the validation of the proposed approach.

Key words: cluster analysis, document clustering, dimension reduction, stochastic neighbor embedding, clustering algorithm　

中图分类号:

TP391

徐秀芳, 徐　森, 花小朋, 徐　静, 皋　军, 安　晶. 一种基于t－分布随机近邻嵌入的文本聚类方法[J]. 南京大学学报(自然科学版), 2019, 55(2): 264–271.

Xu Xiufang, Xu Sen, Hua Xiaopeng, Xu Jing, Gao Jun, An Jing. A document clustering approach based on t－distributed stochastic neighbor embedding[J]. Journal of Nanjing University(Natural Sciences), 2019, 55(2): 264–271.

参考文献

[1]　Tan P N，Steinbach M，Kumar V. Introduction to data mining. Boston：Addison－Wesley Longman，2005，769.
[2] Atkinson－Abutridy J，Mellish C，Aitken S. Combining information extraction with genetic algorithms for text mining. IEEE Intelligent Systems，2004，19(3)：22－30.
[3] Yang L，Zhang J P. Automatic transfer learning for short text mining. EURASIP Journal on Wireless Communications and Networking，2017，2017(1)：42.
[4] Brindha S，Prabha K，Sukumaran S. A survey on classification techniques for text mining ∥ 2016 3rd International Conference on Advanced Computing and Communication Systems. Coimbatore，India：IEEE，2016：1－5.
[5] Li J Y，Fong S，Zhuang Y，et al. Hierarchical classification in text mining for sentiment analysis of online news. Soft Computing，2016，20(9)：3411－3420.
[6] 刘　鹏，滕家雨，丁恩杰等. 基于Spark的大规模文本k－means并行聚类算法. 中文信息学报，2017，31(4)：145－153.(Liu P，Teng J Y，Ding E J，et al. Study of parallelized k－means algorithm on massive text based on Spark. Journal of Chinese Information Processing，2017，31(4)：145－153.)
[7] 王永贵，林　琳，刘宪国. 结合双粒子群和K－means的混合文本聚类算法. 计算机应用研究，2014，31(2)：364－368.(Wang Y G，Lin L，Liu X G. Hybrid text clustering algorithm based on dualparticle swarm optimization and K－means algorithm. Application Research of Computers，2014，31(2)：364－368.)
[8] 徐　森，卢志茂，顾国昌. 解决文本聚类集成问题的两个谱算法. 自动化学报，2009，35(7)：997－1002.(Xu S，Lu Z M，Gu G C. Two spectral algorithms for ensembling document clusters. Acta Automatica Sinica，2009，35(7)：997－1002.)
　 [9] Xu J M，Xu B，Wang P，et al. Self－Taught convolutional neural networks for short text clustering. Neural Networks，2017，88：22－31.
[10] Jain A K. Data clustering：50 years beyond K－means. Pattern Recognition Letters，2010，31(8)：651－666.
[11] Dhillon I S，Modha D S. Concept decompositions for large sparse text data using clustering. Machine Learning，2001，42(1－2)：143－175.
[12] Maaten L V D，Hinton G. Viualizing data using t－SNE. Journal of Machine Learning Research，2008，9(2605)：2579－2605.
[13] Hinton G，Roweis S. Stochastic neighbor embedding. Advances in Neural Information Processing Systems，2003，15：833－840.
[14] Salton G，McGill M J. Introduction to modern information retrieval. New York，USA：McGraw－Hill Book Company，1983，528.
[15] TREC. Text REtrieval conference. http：//trec.nist.gov，2016－11－15.
[16] 20 newsgroups. http：//qwone.com/～jason/20Newsgroups/，2008－01－14.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种基于t－分布随机近邻嵌入的文本聚类方法

A document clustering approach based on t－distributed stochastic neighbor embedding

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

Metrics

本文评价

推荐阅读 8

[1]	刘　宏，申德荣*，寇　月，聂铁铮，于　戈. 基于实体演化的记录链接算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 991-.
[2]	张振华1*，胡　勇2，严玉清3. 一类基于坐标变换的带参数直觉模糊距离构造方法[J]. 南京大学学报(自然科学版), 2017, 53(3): 462-.
[3]	胡克用^1，2*，胥　芳¹，艾青林¹，徐红伟¹，欧阳静¹. 多逆变器光伏发电网络群控策略及实现方法[J]. 南京大学学报(自然科学版), 2016, 52(2): 398-.
[4]	李飞江1*,成红红2,钱宇华1. 全粒度聚类算法[J]. 南京大学学报(自然科学版), 2014, 50(4): 505-.
[5]	魏莱. 自适应全局局部集成判别分析[J]. 南京大学学报(自然科学版), 2014, 50(4): 517-.
[6]	陈永彬1，张琢1,2** . 智能单粒子优化算法在聚类分析中的应用* [J]. 南京大学学报(自然科学版), 2011, 47(5): 578-584.