南京大学学报(自然科学版) ›› 2022, Vol. 58 ›› Issue (4): 570–583.doi: 10.13232/j.cnki.jnju.2022.04.002

• • 上一篇    下一篇

基于拓展约束投影的加权半监督聚类集成算法

张鼎, 杨有龙(), 孙丽芹   

  1. 西安电子科技大学数学与统计学院,西安,710126
  • 收稿日期:2022-04-19 出版日期:2022-07-30 发布日期:2022-08-01
  • 通讯作者: 杨有龙 E-mail:ylyang@mail.xidian.edu.cn
  • 基金资助:
    陕西省自然科学基金(2021JM-133)

Weighted semi⁃supervised clustering ensemble algorithm based on extended constraint projection

Ding Zhang, Youlong Yang(), Liqin Sun   

  1. School of Mathematics and Statistics,Xidian University,Xi'an,710126, China
  • Received:2022-04-19 Online:2022-07-30 Published:2022-08-01
  • Contact: Youlong Yang E-mail:ylyang@mail.xidian.edu.cn

摘要:

半监督聚类集成旨在利用成对约束提升聚类集成的精度,但在高维空间的聚类效果却显著降低,另外,当只有少量的成对约束可以利用时,聚类性能很难提升.针对这些问题,提出一种新颖的半监督聚类集成算法WSCEC (Weighted Semi?supervised Clustering Ensemble Algorithm Based on Extended Constraint Projection).首先,利用多种聚类算法对数据的特征空间进行聚类,再使用随机子空间进行降维,以减少冗余特征的影响;其次,根据每对约束的k个最近或最远的样本以及约束间的传递关系来扩展原有的约束集,通过约束投影技术将原始数据空间投影到低维空间以满足尽可能多的约束;最后,设计了一个聚类解的加权策略,为每一个聚类解分配一个适当的权重以降低低质量聚类解的影响.在多个数据集上的实验结果证明了提出算法的有效性.

关键词: 半监督聚类, 聚类集成, 随机子空间, 约束投影

Abstract:

Semi?supervised clustering ensemble aims at improving the accuracy of clustering ensemble by using pairwise constraints,but it achieves poor performance on high?dimensional datasets. In addition,clustering performance has little improvement when only a few pairwise constraints are available. To solve these problems,this paper proposes a novel semi?supervised clustering ensemble algorithm WSCEC (Weighted Semi?supervised Clustering ensemble algorithm based on Extended Constraint projection algorithm). Firstly,a variety of clustering algorithms are exploited to cluster the feature space of data,and then the random subspace is utilized for the dimension reduction to reduce the impact of redundant features. Secondly,the original constraint set is expanded according to the k nearest or farthest samples of constraints and the transitive relationship between constraints,and the original data space is projected into a low?dimensional space by constraint projection technique to satisfy as many constraints as possible. Finally,a weighting strategy of clustering solutions is designed,which assigns an appropriate weight to each clustering solution to reduce the impact of low?quality clustering solutions. Experimental results on several datasets prove the effectiveness of the proposed algorithm.

Key words: semi?supervised clustering, clustering ensemble, random subspace, constraint projection

中图分类号: 

  • TP311.13

图1

WSCEC算法的框架"

图2

二维数据空间的拓展约束投影示例过程:(a)带有约束的实例;(b)选择k个最近或最远的样本;(c)约束的扩展;(d)约束投影结果"

表1

实验中使用的数据集"

数据集样本量特征数类别
Segmentation2319197
Semeion159325610
Optdigit56206410
ISOLET155961726
Cardiotocography21262310
Sat6435366
COIL201440102420
Yale165102415
MNIST4000400078410
ORL400102440
11_Tumors1741253311
Leukemia272112253

表2

WSCEC与其他算法在12个数据集上的平均NMI对比"

数据集COPE⁃kmeansCNPE⁃kmeansE2CPEWECRMCLASECUSENCWSCEC
Segmentation0.5130.5470.6340.6230.5390.5620.6380.662
Semeion0.5250.5540.5990.5900.5780.5510.6490.633
Optdigit0.8150.7630.9160.7930.8390.6830.8340.914
ISOLET0.6630.7070.6790.7110.6750.6870.7090.755
Cardiotocography0.4710.4790.4640.5270.5110.4500.5440.609
Sat0.5290.5330.5200.5440.5110.5410.5780.625
COIL200.7600.7760.7930.8010.7440.7230.7990.841
Yale0.4170.4770.4380.4130.3510.4140.4170.464
MNIST40000.5040.5430.5950.5840.5510.3630.5880.632
ORL0.6080.6140.5810.6450.4450.6660.6900.736
11_Tumors0.5440.5690.5540.5430.5150.5770.5240.670
Leukemia20.5270.5350.3630.5930.4900.3880.4560.627

表3

WSCEC与其他算法在12个数据集上的平均ARI对比"

数据集COPE⁃kmeansCNPE⁃kmeansE2CPEWECRMCLASECUSENCWSCEC
Segmentation0.3770.4560.5270.5260.4250.3430.5160.567
Semeion0.3910.4540.4870.4990.4640.4570.5350.522
Optdigit0.7760.7230.9210.7610.8060.6710.7800.914
ISOLET0.4150.4820.4400.4880.3860.3910.4540.507
Cardiotocography0.2540.2620.2360.3030.2780.2350.3040.385
Sat0.4390.4490.4340.4650.4210.3530.4860.531
COIL200.6130.6380.6450.6470.5960.5710.5980.681
Yale0.3420.4000.3680.3590.2960.3430.3470.392
MNIST40000.3860.4520.4900.4750.4290.2860.4690.535
ORL0.2170.1860.2350.3110.0630.2770.3270.390
11_Tumors0.3320.3720.3320.3310.3040.3650.3430.536
Leukemia20.5350.5380.3410.5960.4610.320.3970.633

表4

WSCEC与其他算法在12个数据集上的平均ACC对比"

数据集COPE⁃kmeansCNPE⁃kmeansE2CPEWECRMCLASECUSENCWSCEC
Segmentation0.5430.5630.650.6620.5740.5730.6450.668
Semeion0.6240.6690.5950.6710.6610.6170.7240.68
Optdigit0.8110.860.9440.8710.8510.790.910.939
ISOLET0.5320.5670.5410.5520.4320.4470.5450.565
Cardiotocography0.4340.4310.4070.4030.4380.3840.470.533
Sat0.6030.6420.6160.6530.640.4040.70.73
COIL200.6930.7050.7320.7170.6340.5530.6170.751
Yale0.3720.4190.380.3480.3120.3590.3550.397
MNIST40000.550.6210.6660.5830.6010.3340.60.687
ORL0.4120.380.3910.5280.240.4390.4520.553
11_Tumors0.5130.5220.5090.5390.5080.4890.5120.677
Leukemia20.7290.7910.7040.8360.6650.6480.5120.677

图3

不同算法在不同的集成大小下的平均NMI"

图4

不同算法在不同的集成大小下的平均ARI"

表5

WSCEC与WSCEC_not在不同数据集上的NMI"

DatasetsWSCECWSCEC_not
Segmentation0.6500.615
Semeion0.6210.644
Optdigit0.9280.900
ISOLET0.7520.728
Cardiotocography0.6150.508
Sat0.6250.560
COIL200.8320.806
Yale0.4590.401
MNIST40000.6380.603
ORL0.7320.631
11_Tumors0.6830.573
Leukemia20.6230.585

图5

在低维数据集上参数ρ对WSCEC算法的影响"

图6

在高维数据集上参数ρ对WSCEC算法的影响"

1 Jain A K. Data clustering:50 years beyond K?means. Pattern Recognition Letters201031(8):651-666.
2 Wu J J, Liu H F, Xiong H,et al. K?means?based consensus clustering:A unified view. IEEE Transactions on Knowledge and Data Engineering201527(1):155-169.
3 Yu Z W, Luo P N, Liu J M,et al. Semi?supervised ensemble clustering based on selected constraint projection. IEEE Transactions on Knowledge and Data Engineering201830(12):2394-2407.
4 Fern X Z, Brodley C E. Random projection for high dimensional data clustering:A cluster ensemble approach∥Proceedings of the 20th International Conference on Machine Learning. Washington,DC,USA:ACM,2003:186-193.
5 Fred A L N, Jain A K. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence200527(6):835-850.
6 Iam?On N, Boongoen T, Garrett S,et al. A link?based approach to the cluster ensemble problem. IEEE Transactions on Pattern Analysis and Machine Intelligence201133(12):2396-2409.
7 Huang D, Wang C D, Lai J H,et al. Locally weighted ensemble clustering. IEEE Transactions on Cybernetics201848(5):1460-1473.
8 Strehl A, Ghosh J. Cluster ensembles:A knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research20023(3):583-617.
9 Mimaroglu S, Erdil E. Combining multiple clusterings using similarity graph. Pattern Recognition201144(3):694-703.
10 Li T, Ding C, Jordan M I. Solving consensus and semi?supervised clustering problems using nonnegative matrix factorization∥Proceeding of the 21th International Conference on Machine Learning. Omaha,NE,USA:IEEE,2007:577-582.
11 Huang D, Lai J H, Wang C D. Ensemble clustering using factor graph. Pattern Recognition2016(50):131-142.
12 Tian J L, Ren Y Z, Cheng X. Stratified feature sampling for semi?supervised ensemble clustering. IEEE Access2019(7):128669-128675. DOI:10.1109/ACCESS.2019.2939581 .
13 Yu Z W, Luo P N, You J E,et al. Incremental semi?supervised clustering ensemble for high dimensional data clustering. IEEE Transactions on Knowledge and Data Engineering201628(3):701-714.
14 Lai Y X, He S Y, Lin Z J,et al. An adaptive robust semi?supervised clustering framework using weighted consensus of random K?means ensemble. IEEE Transactions on Knowledge and Data Engineering202133(5):1877-1890. DOI:10.1109/TKDE. 2019.2952596 .
15 Iqbal A M, Moh'd A, Khan Z. Semi?supervised clustering ensemble by voting. 2009,arXiv:1208. 4138.
16 Wei S T, Li Z X, Zhang C L. Combined constraint?based with metric?based in semi?supervised clustering ensemble. International Journal of Machine Learning and Cybernetics20189(7):1085-1100.
17 Yang Y, Jiang J M. Bi?weighted ensemble via HMM?based approaches for temporal data clustering. Pattern Recognition2018(76):391-403.
18 Yu Z W, Chen H S, You J,et al. Double selection based semi?supervised clustering ensemble for tumor clustering from gene expression profiles. IEEE/ACM Transactions on Computational Biology and Bioinformatics201411(4):727-740.
19 Yu Z W, Kuang Z Q, Liu J M,et al. Adaptive ensembling of semi?supervised clustering solutions. IEEE Transactions on Knowledge and Data Engineering201729(8):1577-1590.
20 Wang H J, Qi J H, Zheng W F,et al. Semi?supervised cluster ensemble based on binary similarity matrix∥2010 2nd IEEE International Conference on Information Management and Engineering. Chengdu,China:IEEE,2010:251-254.
21 Yang F, Li T, Zhou Q F,et al. Cluster ensemble selection with constraints. Neurocomputing2017(235):59-70.
22 Xiao W C, Yang Y, Wang H J,et al. Semi?supervised hierarchical clustering ensemble and its application. Neurocomputing2016(173):1362-1376.
23 Ho T K. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence199820(8):832-844.
24 Zhang D Q, Chen S C, Zhou Z H,et al. Constraint projections for ensemble learning∥Proceeding of the 23rd National Conference on Artificial Intelligence. Chicago,IL,USA:AAAI Press,2008:758-763.
25 Lu Z W, Peng Y X. Exhaustive and efficient constraint propagation:A graph?based learning approach and its applications. International Journal of Computer Vision2013103(3):306-325.
26 Wagstaff K, Cardie C, Rogers S,et al. Constrained k?means clustering with background knowledge∥Proceedings of the 18th International Conference on Machine Learning. San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,2001:557-584.
27 Wang H J, Li T, Li T R,et al. Constraint neighborhood projections for semi?supervised clustering. IEEE Transactions on Cybernetics201444(5):636-643.
28 Liu H F, Wu J J, Liu T L,et al. Spectral ensemble clustering via weighted K?means:Theoretical and practical evidence. IEEE Transactions on Knowledge and Data Engineering201729(5):1129-1143.
29 Huang D, Wang C D, Wu J S,et al. Ultra?scalable spectral clustering and ensemble clustering. IEEE Transactions on Knowledge and Data Engineering202032(6):1212-1226.
30 Xiong S C, Azimi J, Fern X Z. Active learning of constraints for semi?supervised clustering. IEEE Transactions on Knowledge and Data Engineering201426(1):43-54.
[1] 邵长龙, 孙统风, 丁世飞. 基于信息熵加权的聚类集成算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 189-196.
[2] 杨红鑫,杨绪兵,张福全,业巧林. 半监督平面聚类算法设计[J]. 南京大学学报(自然科学版), 2020, 56(1): 9-18.
[3] 柴变芳,魏春丽,曹欣雨,王建岭. 面向网络结构发现的批量主动学习算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 1020-1029.
[4] 杨鑫, 施虹, 王平心, 徐刚. 基于稳定性的三支聚类[J]. 南京大学学报(自然科学版), 2019, 55(4): 546-552.
[5] 王彤, 魏巍, 王锋. 基于样本对加权共协关系矩阵的聚类集成算法[J]. 南京大学学报(自然科学版), 2019, 55(4): 592-600.
[6]  严丽宇1,魏 巍1,2*,郭鑫垚1,崔军彪1.  一种基于带核随机子空间的聚类集成算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1033-.
[7]  孟 娜1,梁吉业1,2*,庞天杰1. 一种基于抽样的谱聚类集成算法
[J]. 南京大学学报(自然科学版), 2016, 52(6): 1090-.
[8]  常瑜1.2** ,梁吉业1, 2,高嘉伟1,2,杨静1·2
.  一种基于Seeds集和成对约束的半监督聚类算法*[J]. 南京大学学报(自然科学版), 2012, 48(4): 405-411.
[9]  申 彦**,宋顺林,朱玉全
.  一种基于半监督的大规模数据集聚类算法*
[J]. 南京大学学报(自然科学版), 2011, 47(4): 372-382.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!