基于分位数半径的动态K－means算法

南京大学学报(自然科学版) ›› 2018, Vol. 54 ›› Issue (1): 48–.

基于分位数半径的动态K－means算法

程明畅1，刘友波2，张程嘉2，马铁丰1*

出版日期:2018-01-31 发布日期:2018-01-31
作者简介: 1.西南财经大学统计学院，成都，611130；2.四川大学电气信息学院，成都，610065
基金资助:
基金项目：国家自然科学基金(11471264，11401148，11571282，51437003)
收稿日期：2017－12－15
*通讯联系人，E－mail：matiefeng@swufe.edu.cn

Dynamic K－means algorithm based on quantile radius

Cheng Mingchang1，Liu Youbo2，Zhang Chengjia2，Ma Tiefeng1*

Online:2018-01-31 Published:2018-01-31
About author:1.School of Statistics，Southwestern University of Finance and Economics，Chengdu，611130，China；
2．School of Electrical Engineering and Information，Sichuan University，Chengdu，610065，China

摘要/Abstract

摘要： K－means算法是应用最广泛的聚类算法之一，但存在明显缺陷：对初始值敏感，还需给定类的数目．层次K－means算法提出将多次k取固定值的K－means运算所得到的中心点作为类的代表，并通过对这些中心点进行层次聚类来得到更好的初始聚类中心，然而在中心的融合过程中并没有有效利用类的几何信息．从类的几何特征入手，提出一种基于类的分位数半径的动态K－means算法(QRD K－means)．此算法在层次K－means的基础上令每次K－means运算的k值变动起来，且又引入了分位数半径的概念，用样本点到类中心距离的分位数作为类的半径，将样本点间的关系简化为各个类的分位数半径与类中心的关系．通过中心点间距离与分位数半径大小的比较对中心点进行融合形成新类，从而快速给出良好的聚类结果，同时也确定了类的数目．在仿真实验中，通过与不同算法在时间和分类精确度上的比较分析，也证明该方法快速有效．

Abstract: There are obvious deficiencies in the well－known K－means algorithm：one is that it is sensitive to initial values，and the other is that the number of clusters needs to be given artificially.Many related works have improved the performance of K－means，but most of them only focus on single aspect rather than both on speed and identification of the number of clusters.The Hierarchical K－means algorithm(H K－means) was proposed to generate a center set C by carrying out K－means repeatedly with a constant k．After that，it generates the initial cluster centers of K－means by executing Hierarchical clustering algorithm on the center set C．This paper focuses on the geometric features of clusters，proposing a dynamic K－means algorithm based on quantile radius(QRD K－means)．This algorithm is on the basis of Hierarchical K－means and repeats K－means with varying k values.After executing K－means multiple times，a series of cluster centers are obtained to be stored in set C．Moreover，QRD K－means additionally a new concept called quantile radius，which is a certain quantile of distance between the center and the sample points which belong to the corresponding cluster.Then it merges the centers in set C into a new cluster by comparing the pair－wise distance of centers with the sum of the corresponding quantile radiuses.In this way，a clustering result of centers can be obtained adaptively.At last，QRD K－means uses the centers of these new clusters as the initial centers of K－means to get the finial clustering result.The change in k value makes the center set C cover more areas than the previous Hierarchical K－means and can lead to more representative initial cluster centers.The way to get initial cluster centers by means of quantile radius is also faster.In the simulation experiments，QRD K－means was compared with Hierarchical K－means，DP－means and X－means both on the time consumption and accuracy.Results show that QRD K－means method is faster and more efficient.

程明畅1，刘友波2，张程嘉2，马铁丰1*. 基于分位数半径的动态K－means算法[J]. 南京大学学报(自然科学版), 2018, 54(1): 48–.

Cheng Mingchang1，Liu Youbo2，Zhang Chengjia2，Ma Tiefeng1*. Dynamic K－means algorithm based on quantile radius[J]. Journal of Nanjing University(Natural Sciences), 2018, 54(1): 48–.

参考文献

[1]　金建国．聚类方法综述．计算机科学，2014，41(11A)：288－293.(Jin J G．Review of clustering method.Computer Science，2014，41(11A)：288－293.)　
[2] Huang Z X．Extensions to the k－means algorithm for clustering large data sets with categorical values.Data Mining and Knowledge Discovery，1998，2(3)：283－304.
[3] 苏锦旗，薛惠锋，詹海亮．基于划分的K－均值初始聚类中心优化算法．微电子学与计算机，2009，26(1)：8－11.(Su J Q，Xue H F，Zhan H L．K－means initial clustering center optimal algorithm based on partitioning.Microelectronics & Computer，2009，26(1)：8－11.)
[4] Li J H，Song S J，Zhang L Y，et al.Robust K－median and k－means clustering algorithms for incomplete data.Mathematical Problems in Engineering，2017，Article ID：4321928，doi：10.1155/2016/4321928.
[5] Munir M U，Javed M Y，Khan S A．A hierarchical k－means clustering based fingerprint quality classification.Neurocomputing，2012，85：62－67.
[6] Liao K Y，Liu G Z，Xiao L，et al.A sample－based hierarchical adaptive K－means clustering method for large－scale video retrieval.Knowledge－Based Systems，2013，49：123－133.
[7] MacQueen J B．Some methods for classification and analysis of multivariate observations ∥ Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley，CA，USA：University of California Press，1967：281－297.
[8] Ruspini E H．A new approach to clustering.Information and Control，1969，15(1)：22－32.
[9] Arthur D，Vassilvitskii S．k－means＋＋：The advantages of careful seeding ∥ Proceedings of the 18th Annual ACM－SIAM Symposium on Discrete Algorithms.Pennsylvania，PA，USA：SIAM，2007：1027－1035.
[10] Elkan C．Using the triangle inequality to accelerate k－means ∥ Proceedings of the 12th International Conference on International Conference on Machine Learning.Menlo Park，CA，USA：AAAI Press，2003：147－153.
[11] Hamerly G．Making k－means even faster ∥ Proceedings of the 2010 SIAM International Conference on Data Mining.Pennsylvania，PA，USA：SIAM，2010：130－140.
[12] Likas A，Vlassis N，Verbeek J J．The global k－means clustering algorithm.Pattern Recognition，2003，36(2)：451－461.
[13] Bhatia S K．Adaptive k－means clustering ∥ Proceedings of the 17th International Florida Artificial Intelligence Research Society Conference.Menlo Park，CA，USA：AAAI Press，2004：695－699.
[14] Mashor M Y．Hybrid training algorithm for RBF network.International Journal of the Computer，the Internet and Management，2000，8(2)：50－65.
[15] Pelleg D，Moore A W．X－means：Extending K－means with efficient estimation of the number of clusters ∥ Proceedings of the 17th Interna－tional Conference on Machine Learning.San Francisco，CA，USA：Morgan Kaufmann，2000：727－734.
[16] Kulis B，Jordan M I．Revisiting k－means：New algorithms via bayesian nonparametrics ∥ Proceedings of the 29th International Conference on Machine Learning.Edinburgh，United Kingdom：ICML，2012：513－520.
[17] Arai K，Barakbah A R．Hierarchical k－means：An algorithm for centroids initialization for K－means.Reports of the Faculty of Science and Engineering，Saga University，2007，36(1)：25－31.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed