南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (6): 1090.
孟 娜1,梁吉业1,2*,庞天杰1
Meng Na1,Liang Jiye1,2*,Pang Tianjie1
摘要: 谱聚类是利用样本数据集的相似性矩阵中特征向量的性质对样本数据集进行聚类.而随着数据规模的增加,谱聚类算法所耗时间会因为大规模的特征分解而明显增大.采用抽样方法可以有效降低算法所耗时间,但是简单随机抽样子集之间关联性太弱,通常无法准确反映数据集的分布特征.基于此,设计了一种新的抽样策略,利用该方法进行多次抽样,生成多个既具有关联性又具有差异性的数据子集.在每个数据子集上分别利用NJW算法(由Ng A Y、Jordom M I和Weiss Y提出)进行谱聚类,并根据最近邻原则将聚类结果映射到全体数据集,生成若干基聚类,最后,将聚类结果集成,得到最终的聚类划分.实验证明,该方法与传统NJW算法以及简单抽样集成算法相比,算法的效率及有效性有了一定的提高.
[1] 金 冉.面向大规模数据的聚类算法研究及应用.博士学位论文.上海:东华大学,2015.(Jin R.Research and application of clustering algorithm for large scale data.Ph.D.Dissertation.Shanghai:Donghua University,2015.) [2] 蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述.计算机科学,2008,35(7):14-18.(Cai X Y,Dai G Z,Yang L B.Summary of spectral clustering algorithms.Computer Science,2008,35(7):14-18.) [3] Fiedler M.Algebraic connectivity of graphs.Czechoslovak Mathematical Journal,1973,23(98):298-305. [4] Fowlkes C,Belongie S,Chung F,et al.Spectral grouping using the Nyström method.IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(2):214-225. [5] Shinnou H,Sasaki M.Spectral clustering for a large data set by reducing the similarity matrix size.In:Proceedings of the 6th International Conference on Language Resources and Evaluation.Marrakech,Morocco,2008,24(1):201-204. [6] Sakai T,Imiya A.Fast spectral clustering with random projection and sampling.In:Proceedings of the 6th International Conference on Machine Learning and Data Mining.Berlin:Springer Press,2009:372-384. [7] 李 想.大规模数据聚类中的权重抽样方法研究.硕士学位论文.广州:华南理工大学,2014.(Li X.Research on weight sampling method in large scale data clustering.Master Dissertation.South China University of Technology,2014.) [8] 韩 蕾,孙徐湛,吴志川等.MapReduce上基于抽样的数据划分最优化研究.计算机研究与发展,2013,S2:77-84.(Han L,Sun X Z,Wu Z C,et al.Research on the optimization of data partitioning based on sampling in MapReduce.Computer Research and Development,2013,S2:77-84.) [9] 黄发良,黄名选,元昌安等.网络重叠社区发现的谱聚类集成算法.控制与决策,2014,4:713-718.(Huang F L,Huang M X,Yuan C A,et al.Spectral clustering ensemble algorithm for network overlapping community discovery.Control and Decision,2014,4:713-718.) [10] Shi J B,Malik J.Normalized cuts and image segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905. [11] Wgner D,Wagner F.Between min cut and graph bisection.International Symposium on Mathematical Foundations of Computer Science.Berlin:Springer Press,1993,711:744-750. [12] Lütkepohl H.Handbook of matrices.New York:John Wiley and Sons,1996,134. [13] Ng A Y,Jordan M I,Weiss Y.On spectral clustering:Analysis and an algorithm.In:Proceedings of Advances in Neural Information Processing Systems.London:MIT Press,2002,14:849-856. [14] Strehl A,Ghosh J.Cluster ensemblesA knowledge reuse framework for combining multiple partitions.Journal of Machine Learning Research,2002,3(3):583-617. [15] Gionis A,Mannila H,Tsaparas P.Clustering aggregation.ACM Transactions on Knowledge Discovery from Data,2007,1(1):341-352. [16] 罗会兰.聚类集成关键技术研究.博士学位论文.浙江:浙江大学,2007.(Luo H L.Research on key technology of clustering ensemble.Ph.D.Dissertation.Zhejiang:Zhejiang University,2007.) [17] 罗会兰,孔繁胜,李一啸.聚类集成中的差异度量研究.计算机学报,2007,30(8):1315-1324.(Luo H L,Kong F S,Li Y X.Research on the difference measurement of clustering ensemble.Computer Science,2007,30(8):1315-1324.) [18] Zhou Z H,Tang W.Clusterer ensemble.KnowledgeBased Systems,2006,19(1):77-83. [19] Liang J Y,Zhao X W,Li D Y,et al.Determining the number of clusters using information entropy for mixed data.Pattern Recognition,2012,45(6):2251-2265. |
No related articles found! |
|