南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (5): 775–784.doi: 10.13232/j.cnki.jnju.2021.05.007

• • 上一篇    

一种基于样本点距离突变的聚类方法

李娜, 段友祥(), 孙歧峰, 沈楠   

  1. 中国石油大学(华东)计算机科学与技术学院,青岛,266580
  • 收稿日期:2021-05-28 出版日期:2021-09-29 发布日期:2021-09-29
  • 通讯作者: 段友祥 E-mail:yxduan@upc.edu.cn
  • 作者简介:E⁃mail:yxduan@upc.edu.cn
  • 基金资助:
    中石油重大科技项目(ZD2019?183?006);中央高校基本科研业务费专项资金(20CX05017A)

A clustering method based on mutation of sample point distance

Na Li, Youxiang Duan(), Qifeng Sun, Nan Shen   

  1. School of Computer Science and Technology,China University of Petroleum (East China), Qingdao,266580,China
  • Received:2021-05-28 Online:2021-09-29 Published:2021-09-29
  • Contact: Youxiang Duan E-mail:yxduan@upc.edu.cn

摘要:

针对聚类算法常见的难以确定参数、难以适应各种形状的数据集、在提高算法普适性时时间复杂度增大的问题,提出一种新的聚类算法:结合数据集全局和局部的特征寻找样本点距离的突变位置,通过计算样本点的簇内最小距离实现凸球型数据集的聚类;在此基础上提出子簇连结性强弱的概念,依据两个容易确定的参数进行子簇合并来适应各种形状的数据集.将该算法与DBSCAN (Density?Based Spatial Clustering of Applications with Noise)等多种聚类算法在四种经典数据集上比较,结果表明,该算法适用于类簇形状复杂的数据集,在同等聚类能力的算法中计算速度更快,且具有参数少、易确定的优点,在综合性能上表现优秀.

关键词: 聚, 类,距离突变,子簇合并,核密度估计

Abstract:

Clustering algorithms are commonly difficult to determine parameters and to adapt to datasets of various shapes. And their time complexity increses with theire improvement of universality. To solve these problems,we propose a new clustering algorithm which finds the mutation position of the sample point distance by combining the global and the local features,combines the global and the local features look for the mutation position of the sample point distance,and realizes the clustering of the convex spherical datasets by calculating the minimum distance within the cluster of the sample point. On this basis,we propose the concept of sub?cluster connectivity which adapts to datasets of various shapes by merging sub?clusters based on two easily determinded parameters. Comparing this algorithm with DBSCAN (Density?Based Spatial Clustering of Applications with Noise) and other clustering algorithms on four classic datasets,the experimental results show that the proposed algorithm can be applied to datasets with complex cluster shapes,and has faster calculation speed than algorithms with the same clustering ability. Its advantages include fewer parameters which are easier determined with excellent comprehensive performance.

Key words: clustering, distance mutation, sub?cluster merge, kernel density estimation clustering

中图分类号: 

  • TP391

图1

Aggregation数据集样本点的分布图"

图2

L57曲线"

图3

L57的斜率变化"

图4

Aggregation数据集的KDE曲线"

图5

本文的算法流程图"

图6

Aggregation数据集聚类结果"

图7

Twomoons数据集的子簇集合S"

图8

子簇合并的距离跃变聚类算法流程图"

图9

Twomoons数据集聚类结果"

图10

各数据集散点图"

图11

各种算法的聚类结果对比"

表1

各算法的性能对比"

DBSCAN[8]K?Means[9]BIRCH[10]GMM[12]SDSC[14]CFSFDP[15]SOM[11]DEC[16]本文
AggregationF0.9880.7390.7170.6940.9850.8570.6220.3370.997
ACC0.9940.5710.7890.7910.9730.7360.4390.3440.999
NMI0.9820.7490.8600.8610.9710.8500.340-0.998
ARI0.9880.5060.7460.7340.9720.6930.176-0.998
time (s)11.92.90.90.19.835.021.75.81.6
SpiralF1.0000.5970.5700.6231.0001.0000.6650.6771.000
ACC1.0000.5970.5140.6231.0001.0000.6650.6471.000
NMI1.0000.028--1.0001.000--1.000
ARI1.0000.037--1.0001.0000.108-1.000
time (s)15.92.10.70.123.76.57.73.42.9
TwomoonsF1.0000.7780.6470.6341.0000.8010.9190.6580.999
ACC1.0000.7340.6660.6661.0000.7620.8910.6600.999
NMI1.0000.1770.2260.2531.0000.2340.4700.2530.996
ARI1.0000.2180.062-1.0000.2750.6070.1640.999
time (s)40.23.21.10.1549.014.912.95.75.9
ThreeCirclesF0.9990.4200.5110.4171.0000.5650.5260.3680.999
ACC0.9990.5550.5550.3351.0000.5930.5550.3560.999
NMI0.999---1.0000.197--0.994
ARI0.999---1.000---0.997
time (s)230.713.24.70.13385.277.430.110.728.1

表2

各算法参数的讨论和对比"

聚类算法算法主要参数参数调节问题性能影响
DBSCANε邻域距离参数敏感影响聚类结果
ε邻域样本数量
K?means样本簇数参数敏感难以自动确定簇数
BIRCH内部节点最大CF数调参复杂影响聚类(尤其是高维数据)的结果
叶子节点最大CF数
GMM初始聚类中心参数敏感需要多次迭代确定参数,消耗资源
SDSC样本簇数参数敏感难以自动确定簇数
CFSFDP截断距离dc易调节影响类簇中心点选取
SOM质心个数参数敏感影响聚类结果
DEC聚类中心个数调参复杂,需通过指标观察影响聚类结果
本文子簇间最近样本数阈值易调节影响子簇合并结果

图12

各聚类算法的性能比较"

表3

UCI数据集上各算法的聚类指标对比"

算法ACCNMIARI
IrisDBSCAN0.6740.5350.515
GMM0.6670.5890.419
CFSFDP0.7800.6850.575
K?means0.6670.5890.419
本文算法0.8930.7500.729
SymDBSCAN0.7540.4270.414
GMM0.7800.6030.571
CFSFDP0.8510.7560.722
K?means0.7800.6030.571
本文算法0.7940.7750.674
HeartDBSCAN0.4650.3820.385
GMM0.7160.1460.184
CFSFDP0.6930.1320.145
K?means0.7160.1460.184
本文算法0.7330.1430.104

图13

本文算法在“猫类”“狗类”和“—”(未正确分类)图像上的聚类结果"

1 Xu D K,Tian Y J. A comprehensive survey of clustering algorithms. Annals of Data Science,2015,2(2):165-193.
2 Zhao L H,Li Y. Study on urban road network traffic district division based on clustering analysis∥2018 Chinese Automation Congress. Xi'an,China:IEEE,2018:3556-3560.
3 Liu J,Chen Y C,Zou D P,et al. Study on segmented pricing method of electric vehicles charging service fee based on clustering algorithms∥2019 11th International Conference on Intelligent Human?Machine Systems and Cybernetics. Hangzhou,China:IEEE,2019:76-79.
4 Ramírez?Rozo T J,García?álvarez J C,Castellanos?Domínguez C G. Infrared thermal image segmentation using expectation?maximization?based clustering∥2012 17th Symposium of Image,Signal Processing and Artificial Vision. Medellin,CO,USA:IEEE,2012:223-226.
5 Bruse J L,Zuluaga M A,Khushnood A,et al. Detecting clinically meaningful shape clusters in medical image data:Metrics analysis for hierarchical clustering applied to healthy and pathological aortic arches. IEEE Transactions on Biomedical Engineering,2017,64(10):2373-2383.
6 Wan C Y,Ye M Q,Yao C W,et al. Brain MR image segmentation based on Gaussian filtering and improved FCM clustering algorithm∥2017 10th International Congress on Image and Signal Processing,BioMedical Engineering and Informatics. Shanghai,China:IEEE,2017:1-5.
7 许进文. 数据挖掘中聚类分析算法及应用研究. 计算机光盘软件与应用,2013,16(6):176-177.
Xu J W. Reaserch on clustering analysis algorithm and application in data mining. Computer CD Software and Application,2013,16(6):176-177.
8 Ester M,Kriegel H P,Sander J,et al. A density?based algorithm for discovering clusters in large spatial databases with noise∥Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland,OR,USA:AAAI Press,1996:226-231.
9 周爱武,于亚飞. K?Means聚类算法的研究. 计算机技术与发展,2011,21(2):62-65.
Zhou A W,Yu Y F. The research about clustering algorithm of K?Means. Computer Technology and Development,2011,21(2):62-65.
10 Zhang T,Ramakrishnan R,Livny M. BIRCH:An efficient data clustering method for very large databases. ACM SIGMOD Record,1996,25(2):103-114.
11 杨黎刚,苏宏业,张英等. 基于SOM聚类的数据挖掘方法及其应用研究. 计算机工程与科学,2007,29(8):133-136.
Yang L G,Su H Y,Zhang Y,et al. A method of data mining based on SOM clustering and its application. Computer Engineering and Science,2007,29(8):133-136.
12 王凯南,金立左. 基于高斯混合模型的EM算法改进与优化. 工业控制计算机,2017,30(5):115-116,118.
Wang K N,Jin L Z. Improvement and optimization of EM algorithm based on Gaussian mixture model. Industrial Control Computer,2017,30(5):115-116,118.
13 Zelnik?Manor L,Perona P. Self?tuning spectral clustering∥Proceedings of the 17th International Conference on Neural Information Processing Systems. Cambridge,MA,USA:MIT Press,2004:1601-1608.
14 Xie J Y,Zhou Y,Ding L J. Local standard deviation spectral clustering∥2018 IEEE International Conference on Big Data and Smart Computing. Shanghai,China:IEEE,2018:242-250.
15 Rodriguez A,Laio A. Clustering by fast search and find of density peaks. Science,2014,344(6191):1492-1496.
16 Xie J Y,Girshick R,Farhadi A. Unsupervised deep embedding for clustering analysis∥Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York,NY,USA:ACM,2016(48):478-487.
17 张亚平. 谱聚类算法及其应用研究. 硕士学位论文. 太原:中北大学,2014.
Zhang Y P. Spectral clustering algorithm and its application research. Master Dissertation. Taiyuan:North China University,2014.
18 Gionis A,Mannila H,Tsaparas P. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data,2007,1(1):4.
19 Botev Z I,Grotowski J F,Kroese D P. Kernel density estimation via diffusion. Annals of Statistics,2010,38(5):2916-2957.
20 董晓君,程春玲. 基于核密度估计的K?CFSFDP聚类算法. 计算机科学,2018,45(11):244-248.
Dong X J,Cheng C L. K?CFSFDP clustering algorithm based on kernel density estimation. Computer Science,2018,45(11):244-248.
21 王光,林国宇. 改进的自适应参数DBSCAN聚类算法. 计算机工程与应用,2020,56(14):45-51.
Wang G,Lin G Y. Improved adaptive parameter DBSCAN clustering algorithm. Computer Engineering and Application,2020,56(14):45-51.
22 Tung C W. Prediction of pupylation sites using the composition of k?spaced amino acid pairs. Journal of Theoretical Biology,2013(336):11-17.
23 Yang X H,Zhu Q P,Huang Y J,et al. Parameter?free Laplacian centrality peaks clustering. Pattern Recognition Letters,2017(100):167-173.
24 McDaid A F,Greene D,Hurley N. Normalized mutual information to evaluate overlapping community finding algorithms. 2013,arXiv:1110. 2515.
25 Steinley D. Properties of the Hubert?arable adjusted rand index. Psychological Methods,2004,9(3):386-396.
26 任彧,顾成成. 基于HOG特征和SVM的手势识别. 科技通报,2011,27(2):211-214.
Ren Y,Gu C C. Hand gesture recognition based on HOG characters and SVM. Bulletin of Science and Technology,2011,27(2):211-214.
[1] 乔宇鑫, 葛洪伟. 自适应样本加权的多视图聚类算法[J]. 南京大学学报(自然科学版), 2021, 57(4): 544-550.
[2] 杜淑颖, 侯海薇, 丁世飞. 基于多层次特征的深度集成聚类算法[J]. 南京大学学报(自然科学版), 2021, 57(4): 575-581.
[3] 许梦菲, 梁亚静, 臧慧, 李磊. 聚⁃4⁃甲基⁃1⁃戊烯中空纤维膜式人工肺组件的氧气传质性能的研究[J]. 南京大学学报(自然科学版), 2021, 57(4): 641-647.
[4] 叶子琪, 蒋小峰, 汤其阳, 李梅. 聚乙烯微塑料对蚕豆幼苗的毒性效应[J]. 南京大学学报(自然科学版), 2021, 57(3): 385-392.
[5] 杨宗彩, 肖琳. 电化学氧化改善剩余污泥脱水性能的研究[J]. 南京大学学报(自然科学版), 2021, 57(3): 445-450.
[6] 王颖俐, 魏玲. 基于改进的区间损失函数聚合法的三支决策[J]. 南京大学学报(自然科学版), 2021, 57(3): 493-501.
[7] 周小亮, 吴东洋, 曹磊, 王玉鹏, 业宁. 基于修剪树的优化聚类中心算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 167-176.
[8] 陈迪, 刘惊雷. 基于乘法更新规则的k⁃means与谱聚类的联合学习[J]. 南京大学学报(自然科学版), 2021, 57(2): 177-188.
[9] 邵长龙, 孙统风, 丁世飞. 基于信息熵加权的聚类集成算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 189-196.
[10] 夏菁, 丁世飞. 基于低秩稀疏约束的自权重多视角子空间聚类[J]. 南京大学学报(自然科学版), 2020, 56(6): 862-869.
[11] 王丽娟,丁世飞,丁玲. 基于迁移学习的软子空间聚类算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 515-523.
[12] 陈俊芬,赵佳成,韩洁,翟俊海. 基于深度特征表示的Softmax聚类算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 533-540.
[13] 王露,王士同. 改进模糊聚类在医疗卫生数据的Takagi⁃Sugeno模糊模型[J]. 南京大学学报(自然科学版), 2020, 56(2): 186-196.
[14] 杨红鑫,杨绪兵,张福全,业巧林. 半监督平面聚类算法设计[J]. 南京大学学报(自然科学版), 2020, 56(1): 9-18.
[15] 柴变芳,魏春丽,曹欣雨,王建岭. 面向网络结构发现的批量主动学习算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 1020-1029.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!