一种基于抽样的谱聚类集成算法

南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (6): 1090–.

一种基于抽样的谱聚类集成算法

孟　娜1，梁吉业1，2*，庞天杰1

出版日期:2016-11-21 发布日期:2016-11-21
作者简介: 1.太原师范学院计算机科学与技术系，晋中，030619；
2.山西大学计算智能与中文信息处理教育部重点实验室，太原，030006
基金资助:
基金项目：国家自然科学基金(61273294)，山西省回国留学人员科研项目(2013－101)
收稿日期：2016－10－12
*通讯联系人，Email：ljy@sxu.edu.cn

An ensemble algorithm of spectral clustering based on sampling

Meng Na1，Liang Jiye1，2*，Pang Tianjie1

Online:2016-11-21 Published:2016-11-21
About author: 1.Department of Computer Science and Technology，Taiyuan Normal University，Jinzhong，030619，China；
2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education，
Shanxi University，Taiyuan，030006，China

摘要/Abstract

摘要： 谱聚类是利用样本数据集的相似性矩阵中特征向量的性质对样本数据集进行聚类．而随着数据规模的增加，谱聚类算法所耗时间会因为大规模的特征分解而明显增大．采用抽样方法可以有效降低算法所耗时间，但是简单随机抽样子集之间关联性太弱，通常无法准确反映数据集的分布特征．基于此，设计了一种新的抽样策略，利用该方法进行多次抽样，生成多个既具有关联性又具有差异性的数据子集．在每个数据子集上分别利用NJW算法(由Ng A Y、Jordom M I和Weiss Y提出)进行谱聚类，并根据最近邻原则将聚类结果映射到全体数据集，生成若干基聚类，最后，将聚类结果集成，得到最终的聚类划分．实验证明，该方法与传统NJW算法以及简单抽样集成算法相比，算法的效率及有效性有了一定的提高.

Abstract: Spectral clustering algorithm is an important one among clustering algorithms，and it uses the feature vectors of the similarity matrix calculated from the sample data set to cluster the sample data.However，the computational complexity and time consumption will increase markedly because of the largescale calculation of eigendecomposition when the spectral clustering is applied to large scale data sets.The use of sampling methods can effectively reduce the time consumed by the spectral clustering algorithm，but the relationship between the data subsets extracted by simple randomly sampling is too weak，which usually cannot reflect the distribution characteristics of the sample data sets accurately.Based on this and aimed at the computing characteristics of spectral clustering algorithm，a new sampling strategy different from the simple random sampling is designed and multiply used to generate multiple data subsets that can reflect the distribution characteristics of the sample data sets more accurately because of their coexisting relevance and otherness.Then each data subset is spectral clustered by NJW algorithm(the most classical spectral clustering algorithm，proposed by Ng A Y，Jordom M I and Weiss Y)and every clustering results can be mapped to the whole sample data set according to the nearest neighbor principle，generating a number of component clusters which both have relevance and otherness.Finally，the clustering results of the whole sample data set are integrated to get the final unified clustering partition.Experimental results show that applying the proposed sampling method to the spectral clustering algorithm is effective compared with the traditional NJW algorithm and efficient compared with the ensemble algorithm of spectral clustering based on simple sampling.

孟　娜1，梁吉业1，2*，庞天杰1. 一种基于抽样的谱聚类集成算法
[J]. 南京大学学报(自然科学版), 2016, 52(6): 1090–.

Meng Na1，Liang Jiye1，2*，Pang Tianjie1. An ensemble algorithm of spectral clustering based on sampling
[J]. Journal of Nanjing University(Natural Sciences), 2016, 52(6): 1090–.

参考文献

[1]　金　冉．面向大规模数据的聚类算法研究及应用．博士学位论文．上海：东华大学，2015.(Jin R．Research and application of clustering algorithm for large scale data.Ph.D.Dissertation.Shanghai：Donghua University，2015.)
[2] 蔡晓妍，戴冠中，杨黎斌．谱聚类算法综述．计算机科学，2008，35(7)：14－18.(Cai X Y，Dai G Z，Yang L B．Summary of spectral clustering algorithms.Computer Science，2008，35(7)：14－18.)
[3] Fiedler M．Algebraic connectivity of graphs.Czechoslovak Mathematical Journal，1973，23(98)：298－305.
[4] Fowlkes C，Belongie S，Chung F，et al.Spectral grouping using the Nyström method.IEEE Transactions on Pattern Analysis and Machine Intelligence，2004，26(2)：214－225.
[5] Shinnou H，Sasaki M．Spectral clustering for a large data set by reducing the similarity matrix size.In：Proceedings of the 6th International Conference on Language Resources and Evaluation.Marrakech，Morocco，2008，24(1)：201－204.
[6] Sakai T，Imiya A．Fast spectral clustering with random projection and sampling.In：Proceedings of the 6th International Conference on Machine Learning and Data Mining.Berlin：Springer Press，2009：372－384.
[7] 李　想．大规模数据聚类中的权重抽样方法研究．硕士学位论文．广州：华南理工大学，2014.(Li X．Research on weight sampling method in large scale data clustering.Master Dissertation.South China University of Technology，2014.)
[8] 韩　蕾，孙徐湛，吴志川等．MapReduce上基于抽样的数据划分最优化研究．计算机研究与发展，2013，S2：77－84.(Han L，Sun X Z，Wu Z C，et al.Research on the optimization of data partitioning based on sampling in MapReduce.Computer Research and Development，2013，S2：77－84.)
[9] 黄发良，黄名选，元昌安等．网络重叠社区发现的谱聚类集成算法．控制与决策，2014，4：713－718.(Huang F L，Huang M X，Yuan C A，et al.Spectral clustering ensemble algorithm for network overlapping community discovery.Control and Decision，2014，4：713－718.)
[10] Shi J B，Malik J．Normalized cuts and image segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence，2000，22(8)：888－905.
[11] Wgner D，Wagner F．Between min cut and graph bisection.International Symposium on Mathematical Foundations of Computer Science.Berlin：Springer Press，1993，711：744－750.
[12] Lütkepohl H．Handbook of matrices.New York：John Wiley and Sons，1996，134.
[13] Ng A Y，Jordan M I，Weiss Y．On spectral clustering：Analysis and an algorithm.In：Proceedings of Advances in Neural Information Processing Systems.London：MIT Press，2002，14：849－856.
[14] Strehl A，Ghosh J．Cluster ensemblesA knowledge reuse framework for combining multiple partitions.Journal of Machine Learning Research，2002，3(3)：583－617.
[15] Gionis A，Mannila H，Tsaparas P．Clustering aggregation.ACM Transactions on Knowledge Discovery from Data，2007，1(1)：341－352.
[16] 罗会兰．聚类集成关键技术研究．博士学位论文．浙江：浙江大学，2007.(Luo H L．Research on key technology of clustering ensemble.Ph.D.Dissertation.Zhejiang：Zhejiang University，2007.)
[17] 罗会兰，孔繁胜，李一啸．聚类集成中的差异度量研究．计算机学报，2007，30(8)：1315－1324.(Luo H L，Kong F S，Li Y X．Research on the difference measurement of clustering ensemble.Computer Science，2007，30(8)：1315－1324.)
[18] Zhou Z H，Tang W．Clusterer ensemble.KnowledgeBased Systems，2006，19(1)：77－83.
[19] Liang J Y，Zhao X W，Li D Y，et al.Determining the number of clusters using information entropy for mixed data.Pattern Recognition，2012，45(6)：2251－2265.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed