一种基于簇边界的密度峰值点快速搜索聚类算法

南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (2): 368–.

一种基于簇边界的密度峰值点快速搜索聚类算法

贾培灵1，樊建聪1，2*，彭延军1，2

出版日期:2017-03-27 发布日期:2017-03-27
作者简介:?1.山东科技大学计算机科学与工程学院，青岛，266590；2.山东省智慧矿山信息技术重点实验室，青岛，266590
基金资助:
基金项目：国家自然科学基金(61203305，61433012)，山东省重点研发计划(攻关)(2016GSF120012)，山东省自然科学基金(ZR2015FM013)，山东省“泰山学者”攀登计划
收稿日期：2016－11－03
*通讯联系人，E-mail：fanjiancong@sdust.edu.cn

An improved clustering algorithm by fast search and find of density peaks based on boundary samples

Jia Peiling1，Fan Jiancong1，2*，Peng Yanjun1，2

Online:2017-03-27 Published:2017-03-27
About author:?1.College of Computer Science and Engineering，Shandong University of Science and Technology，
Qingdao，266590，China；2.Provincial Key Laboratory for Information Technology of Wisdom Mining of
Shandong Province，Shandong University of Science and Technology，Qingdao，266590，China

摘要/Abstract

摘要： 相比其它聚类算法，密度峰值点快速搜索聚类算法(clustering by fast search and find of density peaks，DPC)只需较少的参数就能达到较好的聚类结果，然而当某个类存在多个密度峰值时，聚类结果不理想．针对这一问题，提出一种基于簇边界划分的DPC算法：B-DPC算法．改进算法首先利用一种新的去除噪声准则对数据集进行清理，再调用DPC算法进行首次聚类．最后搜索并发现邻近类的边界样本，根据边界样本的数量和所占比例，对首次聚类结果进行二次聚类．实验证明，B-DPC算法较好地解决了多密度峰值聚类问题，能够发现任意形状的簇，对噪声不敏感．

Abstract: In data mining community，clustering is one of the most important research topics because of the complexity and nonsupervisory of data.A great deal of techniques are devoted to the study of data clustering algorithms.A paper titled with clustering by fast search and find of density peaks(DPC)was proposed in Science journal，which focused on density-based clustering.Compared with other clustering algorithms，DPC only uses less parameters but can obtain better clustering results.However，when there exist multi density peaks in a cluster，the clustering results are not satisfactory.For this reason a boundary partition-based DPC algorithm，B-DPC，is proposed.B-DPC algorithm improves the standard DPC from two aspects：a criterion of cleaning noisy data and the data clustering processes with two rounds.A new criterion how to judge whether a data instance is a noise is defined by calculating the distances among all data instances.A data instance can be viewed as a noise if the distances between this instance and all noisy data instances in noisy dataset are less than a predetermined threshold.Such noisy data instances are firstly cleaned from dataset，and then B-DPC begins to implement a two-round process.The first-round process is to apply the standard DPC to choose some latent cluster centers.Then some initial clusters can be obtained and the decision graph can be built.The second-round process is to combine those similar clusters into more actual count of clusters，which is implemented by finding boundary data instances，the count of these boundary instances and the ratio of the boundary instances to the near clusters.In order to test the B-DPC algorithm，some classical artificial datasets and real-world datasets are applied to our experiments.And several well-performed clustering algorithms，such as DPC，DBSCAN，K-means，are also used as comparing clustering methods.Experimental results show that B-DPC can solve the multi density peaks problem effectively，and also discover the clusters with arbitrary shapes.

贾培灵1，樊建聪1，2*，彭延军1，2. 一种基于簇边界的密度峰值点快速搜索聚类算法[J]. 南京大学学报(自然科学版), 2017, 53(2): 368–.

Jia Peiling1，Fan Jiancong1，2*，Peng Yanjun1，2. An improved clustering algorithm by fast search and find of density peaks based on boundary samples[J]. Journal of Nanjing University(Natural Sciences), 2017, 53(2): 368–.

参考文献

[1]　Han J，Pei J，Kamber M．Data mining：Concepts and techniques.Elsevier，2011，228－321.
[2] 梁吉业，钱宇华，李德玉等．大数据挖掘的粒计算理论与方法．中国科学：信息科学，2015，45(11)：1355－1369.(Liang J Y，Qian Y H，Li D Y，et al.Theory and method of granular computing for big data mining.Science China：Information Sciences，2015，45(11)：1355－1369.)
[3] 张　虎，谭红叶，钱宇华等．基于集成学习的中文文本欺骗检测研究．计算机研究与发展，2015，52(5)：1005－1013.(Zhang H，Tan H Y，Qian Y H，et al.Chinese text deception based on ensemble learning.Journal of Computer Research and Development，2015，52(5)：1005－1013.)
[4] Grira N，Crucianu M，Boujemaa N．Active semi-supervised fuzzy clustering for image database categorization.In：ACM SIGM International Workshop on Multimedia Information Retrieval.Singapore：DBLP，2005：9－16.
[5] Sun J G，Liu J，Zhao L Y．Clustering algorithms research.Journal of Software，2008，19(1)：48－61.　
[6] Celebi M E，Kingravi H A，Vela P A．A comparative study of efficient initialization methods for the K-means clustering algorithm.Expert Systems with Applications，2013，40(1)：200－210.
[7] 雷小锋，谢昆青，林　帆等．一种基于K-means局部最优性的高效聚类算法．软件学报，2008，19(7)：1683－1692.(Lei X F，Xie K Q，Lin F，et al.An Efficient clustering algorithm based on local optimality of K-means.Journal of Software，2008，19(7)：1683－1692.)
[8] 蔡宇浩，梁永全，樊建聪等．加权局部方差优化初始簇中心的K-means算法．计算机科学与探索，2016，10(5)：732－741.(Cai Y H，Liang Y Q，Fan J C，et al.Optimizing initial cluster centroids by weighted local variance in K-means algorithm.Journal of Frontiers of Computer Science and Technology，2016，10(5)：732－741.)
[9] Karypis G，Han E H，Kumar V．Chameleon：Hierarchical clustering using dynamic modeling.Computer，1999，32(8)：68－75.
[10] Tran T N，Drab K，Daszykowski M．Revised DBSCAN algorithm to cluster data with dense adjacent clusters.Chemometrics & Intelligent Laboratory Systems，2013，120(2)：92－96.
[11] Liu Q，Deng M，Shi Y，et al.A density-based spatial clustering algorithm considering both spatial proximity and attribute similarity.Computers & Geosciences，2012，46：296－309.
[12] Ertöz L，Steinbach M，Kumar V．Finding clusters of different sizes，shapes，and densities in noisy，high dimensional data.In：Proceedings of the 2003 SIAM International Conference on Data Mining.San Francisco，CA，USA：Society for Industrial and Applied Mathematics，2003：47－58.
[13] Dutta M，Mahanta A K，Pujari A K．QROCK：A quick version of the ROCK algorithm for clustering of categorical data.Pattern Recognition Letters，2005，26(15)：2364－2373.
[14] 高　琰，谷士文，唐　琏等．机器学习中谱聚类方法的研究．计算机科学，2007，34(2)：201－203.(Gao Y，Gu S W，Tang L，et al.Research on spectral clustering in machine learning.Computer Science，2007，34(2)：201－203.)
[15] Frey B J，Dueck D．Clustering by passing messages between data points.Science，2007，315(5814)：972－976.
[16] Rodriguez A，Laio A．Clustering by fast search and find of density peaks.Science，2014，344(6191)：1492－1496.
[17] Cheng Y．Mean shift，mode seeking，and clustering.IEEE Transactions on Pattern Analysis and Machine Intelligence，1995，17(8)：790－799.
[18] 谢娟英，高红超，谢维信．K近邻优化的密度峰值快速搜索聚类算法．中国科学：信息科学，2016，46(2)：258－280.(Xie J Y，Gao H C，Xie W X．K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset.Science China：Information Sciences，2016，46(2)：258－280.)
[19] Zhang W，Li J．Extended fast search clustering algorithm：widely density clusters，no density peaks.arXiv preprint arXiv：1505.05610，2015.
[20] Gionis A，Mannila H，Tsaparas P．Clustering aggregation.ACM Transactions on Knowledge Discovery from Data(TKDD)，2007，1(1)：4.
[21] Zahn C T．Graph-theoretical methods for detecting and describing gestalt clusters.IEEE Transactions on Computers，1971，100(1)：68－86.
[22] Fu L，Medico E．FLAME，a novel fuzzy clustering method for the analysis of DNA microarray data.BMC Bioinformatics，2007，8(1)：1.
[23] Jain A K，Law M H．Data clustering：A user’s dilemma.In：International Conference on Pattern Recognition and Machine Intelligence.Springer Berlin Heidelberg，2005：1－10.
[24] Fränti P，Virmajoki O．Iterative shrinking method for clustering problems.Pattern Recognition，2006，39(5)：761－775.
[25] Cover T M，Thomas J A．Information theory and statistics.Elements of Information Theory，1991，279－335.
[26] Cai D，He X，Han J．Document clustering using locality preserving indexing.IEEE Transactions on Knowledge and Data Engineering，2005，17(12)：1624－1637.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed