基于数据集属性相似性的聚类算法推荐

南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (5): 908–.

基于数据集属性相似性的聚类算法推荐

代　明1*，钟才明2，庞永明1，程　凯1

出版日期:2016-09-25 发布日期:2016-09-25
作者简介: 1.宁波大学信息科学与工程学院，宁波，315210；2.宁波大学科学技术学院，宁波，315210
基金资助:
基金项目：国家自然科学基金(61175054)
收稿日期：2016－07－17
*通讯联系人，Email：daiming573@sina.com

Clustering algorithm recommendation based on dataset attributes similarity

Dai Ming1*，Zhong Caiming2，Pang Yongming1，Cheng Kai1

Online:2016-09-25 Published:2016-09-25
About author: 1.College of Information Science and Engineering，Ningbo University，Ningbo，315210，China；2．College of Science and Technology，Ningbo University，Ningbo，315210，China

摘要/Abstract

摘要： 由No Free Lunch理论可知，没有一种聚类算法可完美的解决所有问题．算法推荐是解决此问题的一种有效手段，其核心是数据集相似性的度量．因此提出了一种计算数据集相似性的新方法，通过提取能揭示数据集内在分布和结构的几种属性，然后计算数据集几个属性间的距离，从而得到相似性的度量．首先选择了统计特征向量和二值化向量，然后对数据集进行划分，并计算划分中点到中心点的距离和点对之间的robust pathbased距离得到数据集的紧凑性和连接性．再通过BP网络训练得到4个属性的参数，进而得到了数据集的相似性度量．选择8种人工数据集和8种UCI上的数据集建立数据集库，并选择了7种具有代表性的聚类算法组成算法库．选择UCI上的部分数据集进行实验，结果表明本文提出的方法有较好的效果．

Abstract: According to the No Free Lunch theory，no clustering algorithm can solve all problems，and it is difficult for users to select a suitable algorithm when a number of clustering algorithms are available.An algorithm recommendation system can be a potential solution.In this paper，we propose a framework of clustering algorithm recommendation.Firstly，a dataset and an algorithm library are constructed respectively，and the mapping relationship between the datasets and the algorithms is established by evaluating the performance of the algorithms on the datasets.Then we devise a similarity measure of dataset by calculating the statistical characteristics，binary vector，compactness and connectedness attribute of the datasets and weighting the attributes with BP network.For the input dataset，we find the most similar one in the dataset library by the similarity measure.Finally，the recommended clustering algorithm can be achieved according to the mapping relationship between the datasets and algorithms.In the proposed framework，eight artificial datasets and eight UCI real datasets are selected to construct the datasets library，and seven representative clustering algorithms are used to form the algorithm library.The experiments on some UCI datasets demonstrate the proposed recommendation framework is with satisfact performance.

代　明1*，钟才明2，庞永明1，程　凯1. 基于数据集属性相似性的聚类算法推荐
[J]. 南京大学学报(自然科学版), 2016, 52(5): 908–.

Dai Ming1*，Zhong Caiming2，Pang Yongming1，Cheng Kai1. Clustering algorithm recommendation based on dataset attributes similarity[J]. Journal of Nanjing University(Natural Sciences), 2016, 52(5): 908–.

参考文献

[1] Wolpert D H, Macready W G. No free theorems for search. IEEE Transaction on Evolutionary Computation, 1997, 1(1): 67-82.
[2] 孙吉贵,刘杰,赵连宇. 聚类算法研究. 软件学报, 2008,19(1):48−61. (Sun J G, Liu J, Zhao L Y. Clustering algorithms research. Journal of Software, 2008,19(1):48−61.)
[3] Rice J. The algorithm selection problem. Advances in Computes, 1976, 15: 65–118.
[4] Daniel G F, Leandro N de C. Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences, 2015, 301:181-194.
[5] Tatti N. Distances between data sets based on summary statistics. Journal of Machine Learning Research, 2007 (8): 131–154.
[6] 徐盈盈, 钟才明. 基于集成学习的无监督离散化算法. 计算机应用, 2014, 34(8): 2184-2187. (Xu Y Y， Zhong C M. Unsupervised discretization algorithm based on ensemble learning. Journal of Computer Applications, 2014, 34(8): 2184-2187) .
[7] Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining. Philadelphia, PA: SIAM, 2008:243-254.
[8] Song Q B, Wang G T，Wang C. Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recognition, 2012(45): 2672-2689.
[9] Chang H, Yeung D Y. Robust path-based spectral clustering. Pattern Recognition, 2008, 41(1):191–203.
[10] Daniel G F, Leandro N de C. Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences, 2015, 301: 181-194.
[11] Sheikholeslami G，Chatterjee S，Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th Very Large Data Bases Conference. San Francisco: Morgan Kaufmann Publishers Inc.1998: 428-439.
[12] Strehl A, Ghosh J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002 (3): 583-617.
[13] Brendan J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315: 972-976.
[14] Nguyen N, Caruana R. Consensus clusterings. In: Proceedings of the 7th IEEE International Conference on Data Mining. Washington: IEEE Computer Society , 2007:607–612.
[15] Zhong C M, Yue X D, Lei J S. Visual hierarchical cluster structure: A refined co-association matrix based visual assessment of cluster tendency. Pattern Recognition Letters, 2015, 59:48–55.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed