南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (4): 592–600.doi: 10.13232/j.cnki.jnju.2019.04.009

所属专题: 测试专题

• • 上一篇    下一篇

基于样本对加权共协关系矩阵的聚类集成算法

王彤1,魏巍1,2(),王锋1,2   

  1. 1. 山西大学计算机与信息技术学院,太原,030006
    2. 山西大学计算智能与中文信息处理教育部重点实验室,太原,030006
  • 收稿日期:2019-05-14 出版日期:2019-07-30 发布日期:2019-07-23
  • 通讯作者: 魏巍 E-mail:weiwei@sxu.edu.cn
  • 基金资助:
    山西省高等教育机构科技创新项目(2016111)

Sample pairwise weighting co⁃association matrix based ensemble clustering algorithm

Tong Wang1,Wei Wei1,2(),Feng Wang1,2   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, China
    2. Key Laboratory of Computation Intelligence and Chinese Information Processing, Ministry of Education, Shanxi University, Taiyuan, 030006, China
  • Received:2019-05-14 Online:2019-07-30 Published:2019-07-23
  • Contact: Wei Wei E-mail:weiwei@sxu.edu.cn

摘要:

聚类集成的目标是通过集成多个聚类结果来提高聚类算法的稳定性、鲁棒性以及精度.近些年,聚类集成受到了越来越多的关注.现有的集成聚类通常平等地对待所有基聚类,而不考虑它们的重要度.虽然学者们已经在这一方面做出了一些努力,例如使用加权策略来改进共协关系矩阵,但无论是给基聚类加权还是对类重要度评价时都忽略了样本对于其所在类贡献的差异.为此,提出了基于样本对加权共协关系矩阵的聚类集成算法,该算法利用k?means算法产生多个基聚类结果,然后对于其中的每个类再利用k?means算法产生多个小类,并计算去掉样本对所在的小类后类的不确定性变化的程度来评价该样本对的重要度,最后通过层次聚类算法得到聚类结果.在六个UCI数据集上的实验结果表明,基于样本对加权共协关系矩阵的聚类集成算法的性能优于三种经典的基于共协关系矩阵的聚类集成算法.

关键词: 聚类, 聚类集成, 共协矩阵, 加权策略

Abstract:

The goal of clustering ensemble is to improve the stability,robustness and accuracy of the final clustering results by integrating multiple clustering results. In recent years,clustering ensemble has attracted more and more attention. One limitation of most existing clustering ensemble methods is that they generally treat all base clustering equally,regardless of their importance. Although scholars have made some efforts in this aspect,for example,the weighted strategy is used to improve the co?association matrix. However,they ignore the difference in the contribution of samples to the classes they belong to when either weighting the base clustering or evaluating the class importance. Therefore,sample pairwise weighting co?association matrix based ensemble clustering algorithm is proposed. The algorithm firstly uses the k?means algorithm to generate multiple base partition results and multiple small classes for each class. The importance of the sample to the class is evaluated by calculating the change degree of uncertainty of the class after removing the subclass of the sample pairwise. Finally,the final clustering result can be obtained through the hierarchical clustering algorithm. Experimental results on six UCI data sets show that the performance of sample pairwise weighting co?association matrix based clustering ensemble algorithm is superior to the three classical clustering ensemble algorithms based on co?association matrix.

Key words: clustering, clustering ensemble, co?association matrix, weighted strategy

中图分类号: 

  • TP391

图1

两个基聚类的四种表示方法"

图2

对于不同参数,权值与样本对不稳定程度之间的相关性"

表1

数据集"

DatasetsObjectsDimensionsClasses
Iris15043
Wine178133
Column_3C31063
Glass21496
Musk4761672
Seeds21073

表2

对于不同的参数OWEC算法的平均性能(ARI)"

Datasetsθ
0.10.20.30.40.50.60.70.80.91.02.04.0
Iris0.62490.61410.61830.61670.61070.59850.60030.61700.61240.60230.59370.5976
Wine0.32380.36800.37210.37960.37940.37030.34980.35110.35950.33720.32010.3394
Column_3C0.36160.40570.44030.42340.43200.41960.40660.38000.44950.42060.41860.4226
Glass0.27320.30050.33570.31150.30940.35030.31450.33880.30380.33880.30380.7303
Musk0.16870.29140.25920.26390.25110.26440.29490.28760.29490.23550.20860.2444
Seeds0.36310.42830.45730.45750.46150.48610.47890.43520.54610.51530.48910.4681

表3

对于不同的参数OWEC算法的平均性能(CA)"

Datasetsθ
0.10.20.30.40.50.60.70.80.91.02.04.0
Iris0.85670.84900.86030.85430.84800.84100.84430.84670.85600.84770.84630.8430
Wine0.66460.68200.67130.69040.64830.67890.66460.66120.68400.66210.65900.6671
Column_3C0.71520.74710.76900.75630.75710.74260.74500.74100.76870.75190.75180.7556
Glass0.69930.70080.70980.71780.71360.72030.72310.73760.72100.73760.72100.3318
Musk0.68180.75690.74200.72910.72320.72550.74500.75610.75610.72430.71430.7255
Seeds0.71400.75930.77310.77430.77020.78640.78070.76240.81880.79930.78810.7793

表4

不同的算法在不同数据集上的平均性能(ARI,CA)"

DatasetsARICA
OWECEACLWEALWGPOWECEACLWEALWGP
Iris0.63240.43940.46340.44000.85270.65330.75270.7413
Wine0.36320.30450.33310.32950.66070.63480.63540.6399
Column_3C0.43200.30540.39630.36750.75710.69350.73260.6944
Glass0.35810.20920.27010.25200.72900.67820.64490.6729
Musk0.28760.02910.17850.16870.75610.59240.68930.6890
Seeds0.50160.13170.25220.23850.79430.52860.59520.5886

图3

在不同集成规模下不同的算法在各个数据集上的平均性能(ARI)"

图4

在不同集成规模下不同的算法在各个数据集上的平均性能(CA)"

1 Van HamF. Using Multilevel call matrices in large software projects∥IEEE Symposium on Information Visualization 2003. Seattle,WA,USA:IEEE,2003:227-232.
2 FurnasG F. Generalized fisheye views∥Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems. New York,NY,USA:ACM,1986:16-23.
3 SchafferD,ZaoZ P,GreenbergS,et al. Navigating hierarchically clustered networks through fisheye and full?zoom methods. ACM Transaction on Computer?Human Interaction,1996,3(2):162-188.
4 贾云得,吕宏静,刘万春. 鱼眼变形立体图像恢复稠密深度图的方法. 计算机学报,2000,23(12):1332-1336.
Jia Y D,Lu H J,Liu W C.Fish?eye lens camera stereo vision for dense depth map recovery. Chinese Journal of Computers,2000,23(12):1332-1336.
5 Van HeeK,SidorovaN,VoorhoeveM. Soundness and separability of workflow nets in the stepwise refinement approach∥Proceedings of the 24th International Conference on Application and Theory of Petri Nets. Springer Berlin Heidelberg,2003,2679:337-356.
6 StrehlA,GhoshJ. Cluster ensembles?a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research,2003,3:583-617.
7 Iam?OnN,BoongoenT,GarrettS,et al. A link?based approach to the cluster ensemble problem. IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(12):2396-2409.
8 FredA L N,JainA K. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850.
9 NgA Y,JordanM I,WeissY. On spectral clustering:analysis and an algorithm∥Proceedings of the 14th International Conference on Neural Information Processing Systems. Vancouver,Canada:MIT Press,2002:849-856.
10 FernX Z,BrodleyC E. Solving cluster ensemble problems by bipartite graph partitioning∥Proceedings of the 21st International Conference on Machine Learning. New York,NY,USA:ACM,2004:36-43.
11 KaufmanL,RousseeuwP J. Finding groups in data:an introduction to cluster analysis. Hoboken:John Wiley and Sons Inc,1990.
12 YangY,ChenK. Temporal data clustering via weighted clustering ensemble with different representations. IEEE Transactions on Know?ledge and Data Engineering,2010,23(2):307-320.
13 NandaA,PujariA K. Weighted co?clustering based clustering ensemble∥3rd National Conference on Computer Vision,Pattern Recognition,Image Processing and Graphics. Hubli,India:IEEE Press,2011:46-49.
14 ZhongC M,YueX D,ZhangZ H,et al. A clustering ensemble:Two?level?refined co?association matrix with path?based transformation. Pattern Recognition,2015,48(8):2699-2709.
15 HuangD,WangC D,LaiJ H. Locally weighted ensemble clustering. IEEE Transactions on Cybernetics,2018,48(5):1460-1473.
16 黄栋,王昌栋,赖剑煌等. 基于决策加权的聚类集成算法. 智能系统学报,2016,11(3):418-425.
Huang D,Wang C D,Lai J H,et al.Clustering ensemble by decision weightingCAAI Transactions on Intelligent Systems,2016,11(3):418-425.
17 BerikovV,PestunovI. Ensemble clustering based on weighted co?association matrices:error bound and convergence properties. Pattern Recognition,2017,63:427-436.
18 NazariA,DehghanA,NejatianS,et al. A comprehensive study of clustering ensemble weighting based on cluster quality and diversity. Pattern Analysis and Applications,2019,22(1):133-145.
19 RandW M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association,1971,66(336):846-850.
20 NguyenN,CaruanaR. Consensus clusterings∥Proceedings of the 7th IEEE International Conference on Data Mining. Omaha,NE,USA:IEEE,2007:607-612.
[1] 王丽娟,丁世飞,丁玲. 基于迁移学习的软子空间聚类算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 515-523.
[2] 陈俊芬,赵佳成,韩洁,翟俊海. 基于深度特征表示的Softmax聚类算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 533-540.
[3] 王露,王士同. 改进模糊聚类在医疗卫生数据的Takagi⁃Sugeno模糊模型[J]. 南京大学学报(自然科学版), 2020, 56(2): 186-196.
[4] 杨红鑫,杨绪兵,张福全,业巧林. 半监督平面聚类算法设计[J]. 南京大学学报(自然科学版), 2020, 56(1): 9-18.
[5] 柴变芳,魏春丽,曹欣雨,王建岭. 面向网络结构发现的批量主动学习算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 1020-1029.
[6] 杨鑫, 施虹, 王平心, 徐刚. 基于稳定性的三支聚类[J]. 南京大学学报(自然科学版), 2019, 55(4): 546-552.
[7] 徐秀芳, 徐 森, 花小朋, 徐 静, 皋 军, 安 晶. 一种基于t-分布随机近邻嵌入的文本聚类方法[J]. 南京大学学报(自然科学版), 2019, 55(2): 264-271.
[8] 贾海宁, 王士同. 面向重尾噪声的模糊规则模型[J]. 南京大学学报(自然科学版), 2019, 55(1): 61-72.
[9] 陆慎涛, 葛洪伟, 周 竞. 自动确定聚类中心的移动时间势能聚类算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 143-153.
[10] 朱庆峰1,2,葛洪伟1,2*. 快速特征映射优化的流形密度峰聚类[J]. 南京大学学报(自然科学版), 2018, 54(4): 838-.
[11] 底晓强1,2*,于力伟,刘 旭,Syed Umer. 一种基于演化博弈的低轨卫星切换算法研究[J]. 南京大学学报(自然科学版), 2018, 54(4): 855-.
[12] 帅 惠, 袁晓彤, 刘青山. 基于L0约束的稀疏子空间聚类[J]. 南京大学学报(自然科学版), 2018, 54(1): 23-.
[13] 刘 宏,申德荣*,寇 月,聂铁铮,于 戈. 基于实体演化的记录链接算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 991-.
[14]  严丽宇1,魏 巍1,2*,郭鑫垚1,崔军彪1.  一种基于带核随机子空间的聚类集成算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1033-.
[15]  孙梦梦1,唐旭清1,2*.  基于粒度空间的最小生成树分类算法[J]. 南京大学学报(自然科学版), 2017, 53(5): 963-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 程永林, 李德玉, 王素格. 基于极大相容块的邻域粗糙集模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 529 -536 .
[2] 张勋, 石婉玲, 赵祝萱, 朱聪, 李维智, 贾叙东. 生物基聚醚胺型苯并噁嗪树脂的制备与性能研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 832 -839 .