南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (2): 264271.doi: 10.13232/j.cnki.jnju.2019.02.011
徐秀芳1*,徐 森1,花小朋1,徐 静1,皋 军1,2,安 晶3
Xu Xiufang1*,Xu Sen1,Hua Xiaopeng1,Xu Jing1,Gao Jun1,2,An Jing1
摘要: 文本数据具有高维、稀疏、海量的特性,给传统的聚类算法带来了极大挑战. 提出一种基于t-分布随机近邻嵌入(t-Distributed Stochastic Neighbor Embedding,t-SNE)的文本聚类方法. 首先通过t-SNE将高维文本数据嵌入到低维空间,使得高维空间相似度较低的文本对应的映射点距离较远,而相似度较高的文本对应的映射点距离较近;然后根据低维空间映射点坐标,再采用传统的聚类分析算法进行聚类,得到最终的聚类结果. 在多个基准文本数据集上进行了实验测试,验证了该方法的有效性.
中图分类号:
[1] Tan P N,Steinbach M,Kumar V. Introduction to data mining. Boston:Addison-Wesley Longman,2005,769. [2] Atkinson-Abutridy J,Mellish C,Aitken S. Combining information extraction with genetic algorithms for text mining. IEEE Intelligent Systems,2004,19(3):22-30. [3] Yang L,Zhang J P. Automatic transfer learning for short text mining. EURASIP Journal on Wireless Communications and Networking,2017,2017(1):42. [4] Brindha S,Prabha K,Sukumaran S. A survey on classification techniques for text mining ∥ 2016 3rd International Conference on Advanced Computing and Communication Systems. Coimbatore,India:IEEE,2016:1-5. [5] Li J Y,Fong S,Zhuang Y,et al. Hierarchical classification in text mining for sentiment analysis of online news. Soft Computing,2016,20(9):3411-3420. [6] 刘 鹏,滕家雨,丁恩杰等. 基于Spark的大规模文本k-means并行聚类算法. 中文信息学报,2017,31(4):145-153.(Liu P,Teng J Y,Ding E J,et al. Study of parallelized k-means algorithm on massive text based on Spark. Journal of Chinese Information Processing,2017,31(4):145-153.) [7] 王永贵,林 琳,刘宪国. 结合双粒子群和K-means的混合文本聚类算法. 计算机应用研究,2014,31(2):364-368.(Wang Y G,Lin L,Liu X G. Hybrid text clustering algorithm based on dualparticle swarm optimization and K-means algorithm. Application Research of Computers,2014,31(2):364-368.) [8] 徐 森,卢志茂,顾国昌. 解决文本聚类集成问题的两个谱算法. 自动化学报,2009,35(7):997-1002.(Xu S,Lu Z M,Gu G C. Two spectral algorithms for ensembling document clusters. Acta Automatica Sinica,2009,35(7):997-1002.) [9] Xu J M,Xu B,Wang P,et al. Self-Taught convolutional neural networks for short text clustering. Neural Networks,2017,88:22-31. [10] Jain A K. Data clustering:50 years beyond K-means. Pattern Recognition Letters,2010,31(8):651-666. [11] Dhillon I S,Modha D S. Concept decompositions for large sparse text data using clustering. Machine Learning,2001,42(1-2):143-175. [12] Maaten L V D,Hinton G. Viualizing data using t-SNE. Journal of Machine Learning Research,2008,9(2605):2579-2605. [13] Hinton G,Roweis S. Stochastic neighbor embedding. Advances in Neural Information Processing Systems,2003,15:833-840. [14] Salton G,McGill M J. Introduction to modern information retrieval. New York,USA:McGraw-Hill Book Company,1983,528. [15] TREC. Text REtrieval conference. http://trec.nist.gov,2016-11-15. [16] 20 newsgroups. http://qwone.com/~jason/20Newsgroups/,2008-01-14. |
[1] | 刘 宏,申德荣*,寇 月,聂铁铮,于 戈. 基于实体演化的记录链接算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 991-. |
[2] | 张振华1*,胡 勇2,严玉清3. 一类基于坐标变换的带参数直觉模糊距离构造方法[J]. 南京大学学报(自然科学版), 2017, 53(3): 462-. |
[3] | 胡克用1,2*,胥 芳1,艾青林1,徐红伟1,欧阳静1. 多逆变器光伏发电网络群控策略及实现方法[J]. 南京大学学报(自然科学版), 2016, 52(2): 398-. |
[4] | 李飞江1*,成红红2,钱宇华1. 全粒度聚类算法[J]. 南京大学学报(自然科学版), 2014, 50(4): 505-. |
[5] | 魏莱. 自适应全局局部集成判别分析[J]. 南京大学学报(自然科学版), 2014, 50(4): 517-. |
[6] | 陈永彬1,张琢1,2** . 智能单粒子优化算法在聚类分析中的应用* [J]. 南京大学学报(自然科学版), 2011, 47(5): 578-584. |
|