南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (2): 189–196.doi: 10.13232/j.cnki.jnju.2021.02.003

• • 上一篇    

基于信息熵加权的聚类集成算法

邵长龙, 孙统风, 丁世飞()   

  1. 中国矿业大学计算机科学与技术学院,徐州,221116
  • 收稿日期:2021-01-11 出版日期:2021-03-23 发布日期:2021-03-23
  • 通讯作者: 丁世飞 E-mail:dingsf@cumt.edu.cn
  • 作者简介:E⁃mail:dingsf@cumt.edu.cn
  • 基金资助:
    国家自然科学基金(61976216)

Ensemble clustering based on information entropy weighted

Changlong Shao, Tongfeng Sun, Shifei Ding()   

  1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116,China
  • Received:2021-01-11 Online:2021-03-23 Published:2021-03-23
  • Contact: Shifei Ding E-mail:dingsf@cumt.edu.cn

摘要:

聚类集成的目的是通过集成多个不同的基聚类来生成一个更好的聚类结果,近年来研究者已经提出多个聚类集成算法,但是目前仍存在的局限性是这些算法大多把每个基聚类和每个簇都视为同等重要,使聚类结果很容易受到低质量基聚类和簇的影响.为解决这个问题,研究者提出一些给基聚类加权的方法,但大多把基聚类看作一个整体而忽视其中每个簇的差异.受到信息熵的启发,提出一种基于信息熵加权的聚类集成算法.算法首先对每个簇的不稳定性进行衡量,然后提出一种基于信息熵的簇评价指标,进而从簇层面进行加权,在对加权矩阵进行划分后得到最终的聚类结果.该算法有两个主要优点:第一,提出了一个有效的簇评价性指标;第二,从比基聚类层面更细化的簇层面进行加权.一系列的实验证明了该算法的有效性和鲁棒性.

关键词: 聚类集成,聚, 类,簇层面加权,信息熵

Abstract:

The purpose of ensemble clustering is to generate a better clustering result by integrating multiple different base clustering. In recent years,researchers have proposed multiple ensemble clustering algorithms. However,the current limitation is that most of these algorithms regard each base clustering and each cluster as equally important,which makes the clustering results susceptible to low?quality base clusterings and clusters. In order to solve this problem,researchers have proposed some methods to weight the base clustering,but most of these methods regard the base clustering as a whole,and ignore the difference of the clusters. In this paper,we are inspired by information entropy and propose an ensemble clustering algorithm based on weighted information entropy. This algorithm first measures the uncertainty of each cluster,then proposes a cluster evaluation index based on information entropy,and then weights it from the cluster level. After dividing the weighting matrix,the final clustering result is obtained. The algorithm in this paper has two main advantages. First,it proposes an effective cluster evaluation index. Second,it calculates the weights from the cluster level that is more refined than from the base cluster level. A series of experiments have proved the effectiveness and robustness of the proposed algorithm.

Key words: ensemble clustering, clustering, cluster?level weighted, information entropy

中图分类号: 

  • TP399

图1

聚类集成流程示意图"

表1

实验所用UCI数据集的属性"

数据集样本数类别数特征数
Balancescale62534
Blood74824
Cardiotocograph (CTG)21261021
Dermatology336633
Ionosphere351234
Seeds21037
Wine178313
Zoo101716

表2

不同算法的ARI指标比较"

EACHBGFWCTWEACLWEAIEWEC
Balancescale0.133 (0.060)0.122 (0.054)0.125 (0.054)0.131 (0.066)0.136 (0.103)0.138 (0.052)
Blood0.069 (0.016)0.077 (0.005)0.066 (0.005)0.038 (0.016)0.031 (0.017)0.071 (0.025)
CTG0.130 (0.006)0.137 (0.008)0.125 (0.004)0.126 (0.006)0.137 (0.007)0.142 (0.008)
Dermatology0.723 (0.040)0.721 (0.052)0.725 (0.071)0.719 (0.014)0.722 (0.103)0.730 (0.054)
Ionosphere0.150 (0.007)0.164 (0.010)0.168 (0.010)0.149 (0.006)0.169 (0.003)0.173 (0.008)
Seeds0.664 (0.071)0.653 (0.055)0.658 (0.055)0.588 (0.046)0.628 (0.055)0.696 (0.042)
Wine0.364 (0.025)0.353 (0.051)0.360 (0.001)0.362 (0.024)0.336 (0.079)0.374 (0.002)
Zoo0.677 (0.017)0.703 (0.011)0.704 (0.005)0.683 (0.020)0.700 (0.011)0.709 (0.008)

表3

不同算法的NMI指标比较"

EACHBGFWCTWEACLWEAIEWEC
Balancescale0.102(0.046)0.107(0.045)0.101(0.047)0.105(0.051)0.108(0.088)0.111(0.042)
Blood0.014(0.002)0.016(0.002)0.018(0.004)0.012(0.001)0.020(0.002)0.025(0.006)
CTG0.262(0.005)0.280(0.007)0.265(0.004)0.261(0.005)0.277(0.008)0.268(0.006)
Dermatology0.820(0.001)0.874(0.011)0.872(0.015)0.874(0.014)0.873(0.024)0.876(0.029)
Ionosphere0.115(0.004)0.125(0.006)0.127(0.005)0.114(0.003)0.127(0.003)0.129(0.004)
Seeds0.671(0.041)0.669(0.034)0.670(0.035)0.628(0.028)0.653(0.030)0.689(0.031)
Wine0.409(0.030)0.416(0.036)0.420(0.008)0.412(0.030)0.400(0.057)0.430(0.003)
Zoo0.756(0.007)0.757(0.006)0.759(0.003)0.756(0.007)0.758(0.006)0.761(0.004)

图2

不同聚类集成规模下IEWEC算法的ARI值"

图3

不同聚类集成规模下IEWEC算法的NMI值"

1 Ding S F,Jia H J,Du M J,et al. A semi?supervised approximate spectral clustering algorithm based on HMRF model. Information Sciences,2018,429:215-228.
2 Cong L,Ding S F,Wang L J,et al. Image segmentation algorithm based on superpixel clustering. IET Image Processing,2018,12(11):2030-2035.
3 Saini N,Saha S,Bhattacharyya P. Automatic scientific document clustering using self?organized multi?objective differential evolution. Cognitive Computation,2019,11(2):271-293,doi:10.1007/s12559-018-9611-8.
4 Thanh N D,Ali M,Son L H. A novel clustering algorithm in a neutrosophic recommender system for medical diagnosis. Cognitive Computation,2017,9(4):526-544.
5 Fred A L N,Jain A K. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850.
6 Li E R,Li Q Y,Geng Y A,et al. Ensemble clustering using maximum relative density path∥2018 IEEE International Conference on Big Data and Smart Computing. Shanghai,China:IEEE,2018,doi:10.1109/BigComp.2018.00036.
7 Rathore P,Bezdek J C,Erfani S M,et al. Ensemble fuzzy clustering using cumulative aggregation on random projections. IEEE Transactions on Fuzzy Systems,2018,26(3):1510-1524.
8 Zhong C M,Hu L Y,Yue X D,et al. Ensemble clustering based on evidence extracted from the co?association matrix. Pattern Recognition,2019,92:93-106.
9 Strehl A,Ghosh J. Cluster ensembles:a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research,2003,3:583-617.
10 Fern X Z,Brodley C E. Solving cluster ensemble problems by bipartite graph partitioning∥Proceedings of the 21st International Conference on Machine Learning. New York,NY,USA:ACM,2004,doi:10.1145/1015330.1015414.
11 Shi J B,Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
12 Huang D,Wang C D,Wu J S,et al. Ultra?scalable spectral clustering and ensemble clustering. IEEE Transactions on Knowledge and Data Engineering,2019,32(6):1212-1226,doi:10.1109/TKDE.2019.2903410.
13 Topchy A,Jain A K,Punch W. Clustering ensembles:models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(12):1866-1881.
14 Cristofor D,Simovici D. Finding median partitions using information?theoretical?based genetic algorithms. Journal of Universal Computer Science,2002,8(2):153-172.
15 Wu J J,Liu H F,Xiong H,et al. K?means?based consensus clustering:a unified view. IEEE Transactions on Knowledge and Data Engineering,2014,27(1):155-169.
16 Huang D,Lai J H,Wang C D. Ensemble clustering using factor graph. Pattern Recognition,2016,50:131-142.
17 Yu Z W,Li L,Gao Y J,et al. Hybrid clustering solution selection strategy. Pattern Recognition,2014,47(10):3362-3375.
18 Huang D,Wang C D,Lai J H. Locally weighted ensemble clustering. IEEE Transactions on Cybernetics,2017,48(5):1460-1473.
19 Li Z G,Wu X M,Chang S F. Segmentation using superpixels:a bipartite graph partitioning approach∥2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence,RI,USA:IEEE,2012:789-796.
20 Huang D,Wang C D,Lai J H. LWMC:a locally weighted meta?clustering algorithm for ensemble clustering∥International Conference on Neural Information Processing. Springer Berlin Heidelberg,2017:167-176.
21 He N N,Huang D. Meta?cluster based consensus clustering with local weighting and random walking∥International Conference on Intelligent Science and Big Data Engineering. Springer Berlin Heidelberg,2019:266-277.
22 Iam?On N,Boongoen T,Garrett S M,et al. A link?based approach to the cluster ensemble problem. IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(12):2396-2409.
23 Huang D,Lai J H,Wang C D. Combining multiple clusterings via crowd agreement estimation and multi?granularity link analysis. Neurocomputing,2015,170:240-250.
24 Santos J M,Embrechts M. On the use of the adjusted rand index as a metric for evaluating supervised classification∥International Conference on Artificial Neural Networks. Springer Berlin Heidelberg,2009:175-184.
25 Vinh L T,Lee S,Park Y T,et al. A novel feature selection method based on normalized mutual information. Applied Intelligence,2012,37(1):100-120.
[1] 马明寅, 狄岚, 梁久祯. 基于图像校正和模板分割的纺织品瑕疵检测[J]. 南京大学学报(自然科学版), 2021, 57(1): 29-41.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!