An improved clustering algorithm by fast search and find of density peaks based on boundary samples

 Jia Peiling1,Fan Jiancong1,2*,Peng Yanjun1,2

Journal of Nanjing University(Natural Sciences) ›› 2017, Vol. 53 ›› Issue (2) : 368.

PDF(3225059 KB)
PDF(3225059 KB)
Journal of Nanjing University(Natural Sciences) ›› 2017, Vol. 53 ›› Issue (2) : 368.

An improved clustering algorithm by fast search and find of density peaks based on boundary samples

  •  Jia Peiling1,Fan Jiancong1,2*,Peng Yanjun1,2
Author information +
History +

Abstract

 In data mining community,clustering is one of the most important research topics because of the complexity and nonsupervisory of data.A great deal of techniques are devoted to the study of data clustering algorithms.A paper titled with clustering by fast search and find of density peaks(DPC)was proposed in Science journal,which focused on density-based clustering.Compared with other clustering algorithms,DPC only uses less parameters but can obtain better clustering results.However,when there exist multi density peaks in a cluster,the clustering results are not satisfactory.For this reason a boundary partition-based DPC algorithm,B-DPC,is proposed.B-DPC algorithm improves the standard DPC from two aspects:a criterion of cleaning noisy data and the data clustering processes with two rounds.A new criterion how to judge whether a data instance is a noise is defined by calculating the distances among all data instances.A data instance can be viewed as a noise if the distances between this instance and all noisy data instances in noisy dataset are less than a predetermined threshold.Such noisy data instances are firstly cleaned from dataset,and then B-DPC begins to implement a two-round process.The first-round process is to apply the standard DPC to choose some latent cluster centers.Then some initial clusters can be obtained and the decision graph can be built.The second-round process is to combine those similar clusters into more actual count of clusters,which is implemented by finding boundary data instances,the count of these boundary instances and the ratio of the boundary instances to the near clusters.In order to test the B-DPC algorithm,some classical artificial datasets and real-world datasets are applied to our experiments.And several well-performed clustering algorithms,such as DPC,DBSCAN,K-means,are also used as comparing clustering methods.Experimental results show that B-DPC can solve the multi density peaks problem effectively,and also discover the clusters with arbitrary shapes.

Cite this article

Download Citations
 Jia Peiling1,Fan Jiancong1,2*,Peng Yanjun1,2. An improved clustering algorithm by fast search and find of density peaks based on boundary samples[J]. Journal of Nanjing University(Natural Sciences), 2017, 53(2): 368

References

 [1] Han J,Pei J,Kamber M.Data mining:Concepts and techniques.Elsevier,2011,228-321.
[2] 梁吉业,钱宇华,李德玉等.大数据挖掘的粒计算理论与方法.中国科学:信息科学,2015,45(11):1355-1369.(Liang J Y,Qian Y H,Li D Y,et al.Theory and method of granular computing for big data mining.Science China:Information Sciences,2015,45(11):1355-1369.)
[3] 张 虎,谭红叶,钱宇华等.基于集成学习的中文文本欺骗检测研究.计算机研究与发展,2015,52(5):1005-1013.(Zhang H,Tan H Y,Qian Y H,et al.Chinese text deception based on ensemble learning.Journal of Computer Research and Development,2015,52(5):1005-1013.)
[4] Grira N,Crucianu M,Boujemaa N.Active semi-supervised fuzzy clustering for image database categorization.In:ACM SIGM International Workshop on Multimedia Information Retrieval.Singapore:DBLP,2005:9-16.
[5] Sun J G,Liu J,Zhao L Y.Clustering algorithms research.Journal of Software,2008,19(1):48-61. 
[6] Celebi M E,Kingravi H A,Vela P A.A comparative study of efficient initialization methods for the K-means clustering algorithm.Expert Systems with Applications,2013,40(1):200-210.
[7] 雷小锋,谢昆青,林 帆等.一种基于K-means局部最优性的高效聚类算法.软件学报,2008,19(7):1683-1692.(Lei X F,Xie K Q,Lin F,et al.An Efficient clustering algorithm based on local optimality of K-means.Journal of Software,2008,19(7):1683-1692.)
[8] 蔡宇浩,梁永全,樊建聪等.加权局部方差优化初始簇中心的K-means算法.计算机科学与探索,2016,10(5):732-741.(Cai Y H,Liang Y Q,Fan J C,et al.Optimizing initial cluster centroids by weighted local variance in K-means algorithm.Journal of Frontiers of Computer Science and Technology,2016,10(5):732-741.)
[9] Karypis G,Han E H,Kumar V.Chameleon:Hierarchical clustering using dynamic modeling.Computer,1999,32(8):68-75.
[10] Tran T N,Drab K,Daszykowski M.Revised DBSCAN algorithm to cluster data with dense adjacent clusters.Chemometrics & Intelligent Laboratory Systems,2013,120(2):92-96.
[11] Liu Q,Deng M,Shi Y,et al.A density-based spatial clustering algorithm considering both spatial proximity and attribute similarity.Computers & Geosciences,2012,46:296-309.
[12] Ertöz L,Steinbach M,Kumar V.Finding clusters of different sizes,shapes,and densities in noisy,high dimensional data.In:Proceedings of the 2003 SIAM International Conference on Data Mining.San Francisco,CA,USA:Society for Industrial and Applied Mathematics,2003:47-58.
[13] Dutta M,Mahanta A K,Pujari A K.QROCK:A quick version of the ROCK algorithm for clustering of categorical data.Pattern Recognition Letters,2005,26(15):2364-2373.
[14] 高 琰,谷士文,唐 琏等.机器学习中谱聚类方法的研究.计算机科学,2007,34(2):201-203.(Gao Y,Gu S W,Tang L,et al.Research on spectral clustering in machine learning.Computer Science,2007,34(2):201-203.)
[15] Frey B J,Dueck D.Clustering by passing messages between data points.Science,2007,315(5814):972-976.
[16] Rodriguez A,Laio A.Clustering by fast search and find of density peaks.Science,2014,344(6191):1492-1496.
[17] Cheng Y.Mean shift,mode seeking,and clustering.IEEE Transactions on Pattern Analysis and Machine Intelligence,1995,17(8):790-799.
[18] 谢娟英,高红超,谢维信.K近邻优化的密度峰值快速搜索聚类算法.中国科学:信息科学,2016,46(2):258-280.(Xie J Y,Gao H C,Xie W X.K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset.Science China:Information Sciences,2016,46(2):258-280.)
[19] Zhang W,Li J.Extended fast search clustering algorithm:widely density clusters,no density peaks.arXiv preprint arXiv:1505.05610,2015.
[20] Gionis A,Mannila H,Tsaparas P.Clustering aggregation.ACM Transactions on Knowledge Discovery from Data(TKDD),2007,1(1):4.
[21] Zahn C T.Graph-theoretical methods for detecting and describing gestalt clusters.IEEE Transactions on Computers,1971,100(1):68-86.
[22] Fu L,Medico E.FLAME,a novel fuzzy clustering method for the analysis of DNA microarray data.BMC Bioinformatics,2007,8(1):1.
[23] Jain A K,Law M H.Data clustering:A user’s dilemma.In:International Conference on Pattern Recognition and Machine Intelligence.Springer Berlin Heidelberg,2005:1-10.
[24] Fränti P,Virmajoki O.Iterative shrinking method for clustering problems.Pattern Recognition,2006,39(5):761-775.
[25] Cover T M,Thomas J A.Information theory and statistics.Elements of Information Theory,1991,279-335.
[26] Cai D,He X,Han J.Document clustering using locality preserving indexing.IEEE Transactions on Knowledge and Data Engineering,2005,17(12):1624-1637.
PDF(3225059 KB)

2073

Accesses

0

Citation

Detail

Sections
Recommended

/