南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (6): 1033–.

• • 上一篇    下一篇

 一种基于带核随机子空间的聚类集成算法

 严丽宇1,魏 巍1,2*,郭鑫垚1,崔军彪1   

  • 出版日期:2017-11-26 发布日期:2017-11-26
  • 作者简介: 1.山西大学计算机与信息技术学院,太原,030006;2.计算智能与中文信息处理教育部重点实验室,太原,030006
  • 基金资助:
     基金项目:国家自然科学基金(61303008,61202018),山西省自然科学基金(2013021018-1)
    收稿日期:2017-08-17
    *通讯联系人,E-mail:weiwei@sxu.edu.cn

 Clustering ensemble algorithm based on random subspace with core

 Yan Liyu1,Wei Wei1,2*,Guo Xinyao1,Cui Junbiao1   

  • Online:2017-11-26 Published:2017-11-26
  • About author:1.School of Computer &Information Technology,Shanxi University,Taiyuan,030006,China;
    2.Key Laboratory of Computation Intelligence & Chinese Information Processing,Ministry of Education,Taiyuan,030006,China

摘要:  随机子空间聚类集成通过属性随机采样产生属性子空间,并将子空间上的基聚类结果进行集成得到最终聚类结果.在这一过程中,子空间产生的随机性虽然为聚类集成提供了很大的差异度,但是无法保证基聚类结果的有效性,这是因为随机产生的子空间有可能只包含极少的重要属性.针对这一不足,提出了一种带核随机子空间生成策略:首先依据粗糙集理论中的互补互信息选出对于刻画数据集整体信息至关重要的属性子集,作为每个属性子空间的“核心”,再从剩余属性集中随机选择一定数量的属性与核心属性共同构成聚类子空间.这种策略在兼顾子空间之间差异性的同时也提高了每个属性子空间对数据整体信息的刻画能力,从而得到更好的聚类集成结果.在大量UCI数据集上的实验证实了所提方法的合理性和有效性.

Abstract:  Clustering analysis with a wide range of applications is very important for data mining.At present,clustering algorithms are faced with large-scale and high-dimension data,but the traditional clustering algorithms are not effective to cluster the sparse data in high dimensional data environment.Subspace clustering algorithm,which aims at solving clustering problems in high dimensional data environment,is a newly emerging and quite important embranchment of clustering analysis.For one thing,acting as an extension of the traditional clustering algorithm,subspace clustering plays a vital role in clustering the high dimensional data effectively.For another,clustering ensemble can offer a partition that could better reflect the inherent structure of the data set through integrating many clustering results of the original data set,which improves the quality of clustering to a large degree.Random subspace-based clustering ensemble algorithm generates subspaces through sampling attributes randomly,and then combines base clustering results derived from these attribute subspaces to produce the ensemble clustering result.In the whole process,it is possible that some random subspaces may contain few important attributes,which gives rise to a bad ensemble clustering result ultimately.To address this problem,we propose a core-containing random subspace generating strategy,where we pick out a set of important attributes on the basis of their values of complement mutual information in rough set theory as the core of each attribute subspace first of all,and then combine the core with some attributes sampled randomly from the rest of attributes to construct a random subspace with core.Not only does this random subspace generating strategy provide diversity among subspaces,it also heightens the ability of representing complete information of data for every subspace,which contributes to a better clustering ensemble result.Performing experiments on data from UCI(University of California.Irvine),it turns out that compared with the clustering ensemble based on completely random subspace,the one based on random subspace with core wins in the majority of the data sets.

 

[1] Agresti A.An introduction to categorical data analysis.The 2nd Edition.New Jersey:Wiley,2007,400.
[2] He Z Y,Xu X F,Deng S C.A cluster ensemble method for clustering categorical data.Information Fusion,2005,6(2):143-151.
[3] Strehl A,Ghosh J.Cluster Ensemble-A knowledge reuse framework for combining multiple partions.Journal of Machine Learning Research,2002,3:583-617.
[4] 李桃迎,陈 燕,张金松等.一种面向分类属性数据的聚类融合算法研究.计算机应用研究,2011,28(5):1671-1673.(Li T Y,Chen Y,Zhang J S,et al.Clustering ensemble algorithm for categorical data.Application Research of Computers,2011,28(5):1671-1673.)
[5] Fred A L N,Jain A K.Combining multiple clusterings using evidence accumulation.IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850.
[6] Al-Razgan M,DomeniconiI C,Barbará D.Random subspace ensembles for clustering categorical data.In:Okun O,Valentini G.Supervised and Unsupervised Ensemble Methods and Their Applications.Spring Berlin Heidelberg,2008:31-48.
[7] Minaei-Bidgoli B,Topchy A,Punch W F.A comparison of resampling methods for clustering ensembles.In:Proceedings of Conference on Machine Learning methods,technology and application.Las Vegas,CA,USA:ICAI,2004:939-945.
[8] Fern X Z,Brodley C E.Random projection for high dimensional data clustering:A cluster ensemble approach.In:Proceedings of International Conference on Machine Learning.Washington DC,USA:ICML,2003:63-74.
[9] Huang Z X.Clustering large data sets with mixed numeric and categorical values.In:Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference.Singapore,Singapore:World Scientific,1997:21-35.
[10] 阳林赟,王文渊.聚类融合方法综述.计算机应用研究,2005,22(12):8-10,14.(Yang L Y,Wang W Y.Clustering ensemble approaches:An overview.Application Research of Computers,2005,22(12):8-10,14.)
[11] Li T,Ogihara M,Ma S.On combining multiple clusterings:An overview and a new perspective.Applied Intelligence,2010,33(2):207-219.
[12] Vega-Pons S,Ruiz-Shulcloper J.A survey of clustering ensemble algorithms.International Journal of Pattern Recognition and Artificial Intelligence,2011,25(3):337-372.
[13] Dimitriadou E,Weingessel A,Hornik K.Voting-Merging:An ensemble method for clustering.In:Dorffner G,Bischof H,Hornik K.International Conference on Artificial Neural Networks(ICANN2001).Springer Berlin Heidelberg,2001:217-224.
[14] Fischer B,Buhmann J M.Bagging for path-based clustering.IEEE Transactions on Pattern Analysis and Machine Intelligence,2003,25(11):1411-1415.
[15] Ayad H G,Kamel M S.Cumulative voting consensus method for partitions with variable number of clusters.IEEE Transactions on Pattern Analysis and Machine Intelligence,2008,30(1):160-173.
[16] Tumer K,Agogino A K.Ensemble clustering with voting active clusters.Pattern Recognition Letters,2008,29(14):1947-1953.
[17] Wang X,Yang C Y,Zhou J.Clustering aggregation by probability accumulation.Pattern Recognition,2009,42(5):668-675.
[18] Li F J,Qian Y H,Wang J T,et al.Multigranulation information fusion:A Dempster-Shafer evidence theory-based clustering ensemble method.Information Sciences,2017,378:389-409.
[19] Pawlak Z.Rough sets:Theoretical aspects of reasoning about data.Dordrecht:Kluwer Academic Publishers,1991,231.
[20] Liang J Y,Chin K S,Dang C Y,et al.A new method for measuring uncertainty and fuzziness in rough set theory.International Journal of General Systems,2002,31(4):331-342.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!