一种基于样本点距离突变的聚类方法

doi:10.13232/j.cnki.jnju.2021.05.007

南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (5): 775–784.doi: 10.13232/j.cnki.jnju.2021.05.007

• • 上一篇

一种基于样本点距离突变的聚类方法

李娜, 段友祥(), 孙歧峰, 沈楠

中国石油大学（华东）计算机科学与技术学院，青岛，266580

收稿日期:2021-05-28 出版日期:2021-09-29 发布日期:2021-09-29
通讯作者: 段友祥 E-mail:yxduan@upc.edu.cn
作者简介:E⁃mail：yxduan@upc.edu.cn
基金资助:
中石油重大科技项目(ZD2019?183?006);中央高校基本科研业务费专项资金(20CX05017A)

A clustering method based on mutation of sample point distance

Na Li, Youxiang Duan(), Qifeng Sun, Nan Shen

School of Computer Science and Technology，China University of Petroleum (East China), Qingdao，266580，China

Received:2021-05-28 Online:2021-09-29 Published:2021-09-29
Contact: Youxiang Duan E-mail:yxduan@upc.edu.cn

摘要/Abstract

摘要：

针对聚类算法常见的难以确定参数、难以适应各种形状的数据集、在提高算法普适性时时间复杂度增大的问题，提出一种新的聚类算法：结合数据集全局和局部的特征寻找样本点距离的突变位置，通过计算样本点的簇内最小距离实现凸球型数据集的聚类；在此基础上提出子簇连结性强弱的概念，依据两个容易确定的参数进行子簇合并来适应各种形状的数据集.将该算法与DBSCAN （Density?Based Spatial Clustering of Applications with Noise）等多种聚类算法在四种经典数据集上比较，结果表明，该算法适用于类簇形状复杂的数据集，在同等聚类能力的算法中计算速度更快，且具有参数少、易确定的优点，在综合性能上表现优秀.

关键词: 聚, 类，距离突变，子簇合并，核密度估计

Abstract:

Clustering algorithms are commonly difficult to determine parameters and to adapt to datasets of various shapes. And their time complexity increses with theire improvement of universality. To solve these problems,we propose a new clustering algorithm which finds the mutation position of the sample point distance by combining the global and the local features，combines the global and the local features look for the mutation position of the sample point distance，and realizes the clustering of the convex spherical datasets by calculating the minimum distance within the cluster of the sample point. On this basis,we propose the concept of sub?cluster connectivity which adapts to datasets of various shapes by merging sub?clusters based on two easily determinded parameters. Comparing this algorithm with DBSCAN （Density?Based Spatial Clustering of Applications with Noise） and other clustering algorithms on four classic datasets，the experimental results show that the proposed algorithm can be applied to datasets with complex cluster shapes，and has faster calculation speed than algorithms with the same clustering ability. Its advantages include fewer parameters which are easier determined with excellent comprehensive performance.

Key words: clustering, distance mutation, sub?cluster merge, kernel density estimation clustering

中图分类号:

TP391

李娜, 段友祥, 孙歧峰, 沈楠. 一种基于样本点距离突变的聚类方法[J]. 南京大学学报(自然科学版), 2021, 57(5): 775–784.

Na Li, Youxiang Duan, Qifeng Sun, Nan Shen. A clustering method based on mutation of sample point distance[J]. Journal of Nanjing University(Natural Sciences), 2021, 57(5): 775–784.

图/表 16

图1

图2

图3

图4

图5

图6

图7

图8

图9

图10

图11

表1

表2

各算法参数的讨论和对比"

聚类算法	算法主要参数	参数调节问题	性能影响
DBSCAN	ε邻域距离	参数敏感	影响聚类结果
DBSCAN	ε邻域样本数量	参数敏感	影响聚类结果
K?means	样本簇数	参数敏感	难以自动确定簇数
BIRCH	内部节点最大CF数	调参复杂	影响聚类（尤其是高维数据）的结果
BIRCH	叶子节点最大CF数	调参复杂	影响聚类（尤其是高维数据）的结果
GMM	初始聚类中心	参数敏感	需要多次迭代确定参数，消耗资源
SDSC	样本簇数	参数敏感	难以自动确定簇数
CFSFDP	截断距离 $d c$	易调节	影响类簇中心点选取
SOM	质心个数	参数敏感	影响聚类结果
DEC	聚类中心个数	调参复杂，需通过指标观察	影响聚类结果
本文	子簇间最近样本数阈值	易调节	影响子簇合并结果

表2

图12

表3

图13

参考文献 26

1	Xu D K，Tian Y J. A comprehensive survey of clustering algorithms. Annals of Data Science，2015，2(2)：165-193.
2	Zhao L H，Li Y. Study on urban road network traffic district division based on clustering analysis∥2018 Chinese Automation Congress. Xi'an，China：IEEE，2018：3556-3560.
3	Liu J，Chen Y C，Zou D P，et al. Study on segmented pricing method of electric vehicles charging service fee based on clustering algorithms∥2019 11th International Conference on Intelligent Human?Machine Systems and Cybernetics. Hangzhou，China：IEEE，2019：76-79.
4	Ramírez?Rozo T J，García?álvarez J C，Castellanos?Domínguez C G. Infrared thermal image segmentation using expectation?maximization?based clustering∥2012 17^th Symposium of Image，Signal Processing and Artificial Vision. Medellin，CO，USA：IEEE，2012：223-226.
5	Bruse J L，Zuluaga M A，Khushnood A，et al. Detecting clinically meaningful shape clusters in medical image data：Metrics analysis for hierarchical clustering applied to healthy and pathological aortic arches. IEEE Transactions on Biomedical Engineering，2017，64(10)：2373-2383.
6	Wan C Y，Ye M Q，Yao C W，et al. Brain MR image segmentation based on Gaussian filtering and improved FCM clustering algorithm∥2017 10^th International Congress on Image and Signal Processing，BioMedical Engineering and Informatics. Shanghai，China：IEEE，2017：1-5.
7	许进文. 数据挖掘中聚类分析算法及应用研究. 计算机光盘软件与应用，2013，16(6)：176-177.
	Xu J W. Reaserch on clustering analysis algorithm and application in data mining. Computer CD Software and Application，2013，16(6)：176-177.
8	Ester M，Kriegel H P，Sander J，et al. A density?based algorithm for discovering clusters in large spatial databases with noise∥Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland，OR，USA：AAAI Press，1996：226-231.
9	周爱武，于亚飞. K?Means聚类算法的研究. 计算机技术与发展，2011，21(2)：62-65.
	Zhou A W，Yu Y F. The research about clustering algorithm of K?Means. Computer Technology and Development，2011，21(2)：62-65.
10	Zhang T，Ramakrishnan R，Livny M. BIRCH：An efficient data clustering method for very large databases. ACM SIGMOD Record，1996，25(2)：103-114.
11	杨黎刚，苏宏业，张英等. 基于SOM聚类的数据挖掘方法及其应用研究. 计算机工程与科学，2007，29(8)：133-136.
	Yang L G，Su H Y，Zhang Y，et al. A method of data mining based on SOM clustering and its application. Computer Engineering and Science，2007，29(8)：133-136.
12	王凯南，金立左. 基于高斯混合模型的EM算法改进与优化. 工业控制计算机，2017，30(5)：115-116，118.
	Wang K N，Jin L Z. Improvement and optimization of EM algorithm based on Gaussian mixture model. Industrial Control Computer，2017，30(5)：115-116，118.
13	Zelnik?Manor L，Perona P. Self?tuning spectral clustering∥Proceedings of the 17th International Conference on Neural Information Processing Systems. Cambridge，MA，USA：MIT Press，2004：1601-1608.
14	Xie J Y，Zhou Y，Ding L J. Local standard deviation spectral clustering∥2018 IEEE International Conference on Big Data and Smart Computing. Shanghai，China：IEEE，2018：242-250.
15	Rodriguez A，Laio A. Clustering by fast search and find of density peaks. Science，2014，344(6191)：1492-1496.
16	Xie J Y，Girshick R，Farhadi A. Unsupervised deep embedding for clustering analysis∥Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York，NY，USA：ACM，2016(48)：478-487.
17	张亚平. 谱聚类算法及其应用研究. 硕士学位论文. 太原：中北大学，2014.
	Zhang Y P. Spectral clustering algorithm and its application research. Master Dissertation. Taiyuan：North China University，2014.
18	Gionis A，Mannila H，Tsaparas P. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data，2007，1(1)：4.
19	Botev Z I，Grotowski J F，Kroese D P. Kernel density estimation via diffusion. Annals of Statistics，2010，38(5)：2916-2957.
20	董晓君，程春玲. 基于核密度估计的K?CFSFDP聚类算法. 计算机科学，2018，45(11)：244-248.
	Dong X J，Cheng C L. K?CFSFDP clustering algorithm based on kernel density estimation. Computer Science，2018，45(11)：244-248.
21	王光，林国宇. 改进的自适应参数DBSCAN聚类算法. 计算机工程与应用，2020，56(14)：45-51.
	Wang G，Lin G Y. Improved adaptive parameter DBSCAN clustering algorithm. Computer Engineering and Application，2020，56(14)：45-51.
22	Tung C W. Prediction of pupylation sites using the composition of k?spaced amino acid pairs. Journal of Theoretical Biology，2013(336)：11-17.
23	Yang X H，Zhu Q P，Huang Y J，et al. Parameter?free Laplacian centrality peaks clustering. Pattern Recognition Letters，2017(100)：167-173.
24	McDaid A F，Greene D，Hurley N. Normalized mutual information to evaluate overlapping community finding algorithms. 2013,arXiv：1110. 2515.
25	Steinley D. Properties of the Hubert?arable adjusted rand index. Psychological Methods，2004，9(3)：386-396.
26	任彧，顾成成. 基于HOG特征和SVM的手势识别. 科技通报，2011，27(2)：211-214.
	Ren Y，Gu C C. Hand gesture recognition based on HOG characters and SVM. Bulletin of Science and Technology，2011，27(2)：211-214.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

		DBSCAN^[8]	K?Means^[9]	BIRCH^[10]	GMM^[12]	SDSC^[14]	CFSFDP^[15]	SOM^[11]	DEC^[16]	本文
Aggregation	F	0.988	0.739	0.717	0.694	0.985	0.857	0.622	0.337	0.997
	ACC	0.994	0.571	0.789	0.791	0.973	0.736	0.439	0.344	0.999
	NMI	0.982	0.749	0.860	0.861	0.971	0.850	0.340	-	0.998
	ARI	0.988	0.506	0.746	0.734	0.972	0.693	0.176	-	0.998
	time (s)	11.9	2.9	0.9	0.1	9.83	5.0	21.7	5.8	1.6
Spiral	F	1.000	0.597	0.570	0.623	1.000	1.000	0.665	0.677	1.000
	ACC	1.000	0.597	0.514	0.623	1.000	1.000	0.665	0.647	1.000
	NMI	1.000	0.028	-	-	1.000	1.000	-	-	1.000
	ARI	1.000	0.037	-	-	1.000	1.000	0.108	-	1.000
	time (s)	15.9	2.1	0.7	0.1	23.7	6.5	7.7	3.4	2.9
Twomoons	F	1.000	0.778	0.647	0.634	1.000	0.801	0.919	0.658	0.999
	ACC	1.000	0.734	0.666	0.666	1.000	0.762	0.891	0.660	0.999
	NMI	1.000	0.177	0.226	0.253	1.000	0.234	0.470	0.253	0.996
	ARI	1.000	0.218	0.062	-	1.000	0.275	0.607	0.164	0.999
	time (s)	40.2	3.2	1.1	0.15	49.0	14.9	12.9	5.7	5.9
ThreeCircles	F	0.999	0.420	0.511	0.417	1.000	0.565	0.526	0.368	0.999
	ACC	0.999	0.555	0.555	0.335	1.000	0.593	0.555	0.356	0.999
	NMI	0.999	-	-	-	1.000	0.197	-	-	0.994
	ARI	0.999	-	-	-	1.000	-	-	-	0.997
	time (s)	230.7	13.2	4.7	0.13	385.2	77.4	30.1	10.7	28.1

	算法	ACC	NMI	ARI
Iris	DBSCAN	0.674	0.535	0.515
	GMM	0.667	0.589	0.419
	CFSFDP	0.780	0.685	0.575
	K?means	0.667	0.589	0.419
	本文算法	0.893	0.750	0.729
Sym	DBSCAN	0.754	0.427	0.414
	GMM	0.780	0.603	0.571
	CFSFDP	0.851	0.756	0.722
	K?means	0.780	0.603	0.571
	本文算法	0.794	0.775	0.674
Heart	DBSCAN	0.465	0.382	0.385
	GMM	0.716	0.146	0.184
	CFSFDP	0.693	0.132	0.145
	K?means	0.716	0.146	0.184
	本文算法	0.733	0.143	0.104

[1]	乔宇鑫, 葛洪伟. 自适应样本加权的多视图聚类算法[J]. 南京大学学报(自然科学版), 2021, 57(4): 544-550.
[2]	杜淑颖, 侯海薇, 丁世飞. 基于多层次特征的深度集成聚类算法[J]. 南京大学学报(自然科学版), 2021, 57(4): 575-581.
[3]	许梦菲, 梁亚静, 臧慧, 李磊. 聚⁃4⁃甲基⁃1⁃戊烯中空纤维膜式人工肺组件的氧气传质性能的研究[J]. 南京大学学报(自然科学版), 2021, 57(4): 641-647.
[4]	叶子琪, 蒋小峰, 汤其阳, 李梅. 聚乙烯微塑料对蚕豆幼苗的毒性效应[J]. 南京大学学报(自然科学版), 2021, 57(3): 385-392.
[5]	杨宗彩, 肖琳. 电化学氧化改善剩余污泥脱水性能的研究[J]. 南京大学学报(自然科学版), 2021, 57(3): 445-450.
[6]	王颖俐, 魏玲. 基于改进的区间损失函数聚合法的三支决策[J]. 南京大学学报(自然科学版), 2021, 57(3): 493-501.
[7]	周小亮, 吴东洋, 曹磊, 王玉鹏, 业宁. 基于修剪树的优化聚类中心算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 167-176.
[8]	陈迪, 刘惊雷. 基于乘法更新规则的k⁃means与谱聚类的联合学习[J]. 南京大学学报(自然科学版), 2021, 57(2): 177-188.
[9]	邵长龙, 孙统风, 丁世飞. 基于信息熵加权的聚类集成算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 189-196.
[10]	夏菁, 丁世飞. 基于低秩稀疏约束的自权重多视角子空间聚类[J]. 南京大学学报(自然科学版), 2020, 56(6): 862-869.
[11]	王丽娟,丁世飞,丁玲. 基于迁移学习的软子空间聚类算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 515-523.
[12]	陈俊芬,赵佳成,韩洁,翟俊海. 基于深度特征表示的Softmax聚类算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 533-540.
[13]	王露,王士同. 改进模糊聚类在医疗卫生数据的Takagi⁃Sugeno模糊模型[J]. 南京大学学报(自然科学版), 2020, 56(2): 186-196.
[14]	杨红鑫,杨绪兵,张福全,业巧林. 半监督平面聚类算法设计[J]. 南京大学学报(自然科学版), 2020, 56(1): 9-18.
[15]	柴变芳,魏春丽,曹欣雨,王建岭. 面向网络结构发现的批量主动学习算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 1020-1029.

一种基于样本点距离突变的聚类方法

A clustering method based on mutation of sample point distance

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 26

相关文章 15

Metrics

本文评价

推荐阅读 0