南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (5): 908.
代 明1*,钟才明2,庞永明1,程 凯1
Dai Ming1*,Zhong Caiming2,Pang Yongming1,Cheng Kai1
摘要: 由No Free Lunch理论可知,没有一种聚类算法可完美的解决所有问题.算法推荐是解决此问题的一种有效手段,其核心是数据集相似性的度量.因此提出了一种计算数据集相似性的新方法,通过提取能揭示数据集内在分布和结构的几种属性,然后计算数据集几个属性间的距离,从而得到相似性的度量.首先选择了统计特征向量和二值化向量,然后对数据集进行划分,并计算划分中点到中心点的距离和点对之间的robust pathbased距离得到数据集的紧凑性和连接性.再通过BP网络训练得到4个属性的参数,进而得到了数据集的相似性度量.选择8种人工数据集和8种UCI上的数据集建立数据集库,并选择了7种具有代表性的聚类算法组成算法库.选择UCI上的部分数据集进行实验,结果表明本文提出的方法有较好的效果.
[1] Wolpert D H, Macready W G. No free theorems for search. IEEE Transaction on Evolutionary Computation, 1997, 1(1): 67-82. [2] 孙吉贵,刘 杰,赵连宇. 聚类算法研究. 软件学报, 2008,19(1):48−61. (Sun J G, Liu J, Zhao L Y. Clustering algorithms research. Journal of Software, 2008,19(1):48−61.) [3] Rice J. The algorithm selection problem. Advances in Computes, 1976, 15: 65–118. [4] Daniel G F, Leandro N de C. Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences, 2015, 301:181-194. [5] Tatti N. Distances between data sets based on summary statistics. Journal of Machine Learning Research, 2007 (8): 131–154. [6] 徐盈盈, 钟才明. 基于集成学习的无监督离散化算法. 计算机应用, 2014, 34(8): 2184-2187. (Xu Y Y, Zhong C M. Unsupervised discretization algorithm based on ensemble learning. Journal of Computer Applications, 2014, 34(8): 2184-2187) . [7] Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining. Philadelphia, PA: SIAM, 2008:243-254. [8] Song Q B, Wang G T,Wang C. Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recognition, 2012(45): 2672-2689. [9] Chang H, Yeung D Y. Robust path-based spectral clustering. Pattern Recognition, 2008, 41(1):191–203. [10] Daniel G F, Leandro N de C. Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences, 2015, 301: 181-194. [11] Sheikholeslami G,Chatterjee S,Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th Very Large Data Bases Conference. San Francisco: Morgan Kaufmann Publishers Inc.1998: 428-439. [12] Strehl A, Ghosh J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002 (3): 583-617. [13] Brendan J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315: 972-976. [14] Nguyen N, Caruana R. Consensus clusterings. In: Proceedings of the 7th IEEE International Conference on Data Mining. Washington: IEEE Computer Society , 2007:607–612. [15] Zhong C M, Yue X D, Lei J S. Visual hierarchical cluster structure: A refined co-association matrix based visual assessment of cluster tendency. Pattern Recognition Letters, 2015, 59:48–55. |
No related articles found! |
|